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Sir: 



Further to the Notice of Appeal filed May 9, 2001, and received at the Patent Office on 
May 15, 2001, herewith are three copies of Appellants' Brief on Appeal. Appellants hereby 
request a one-month extension of time in order to file this Brief. Authorized fees include the 
statutory fee of $1 10.00 for a one-month extension of time, as well as the $310.00 fee for the 
filing of this Brief. 

This is an appeal from the decision of the Examiner finally rejecting claims 1, 2, 21, 22, 
and 27-29 of the above-identified application. 



(1) REAL PARTY IN INTEREST 
The above-identified application is assigned of record to Incyte Pharmaceuticals, Inc. 
(now Incyte Genomics, Inc.), (Reel 9403, Frame 0092) who is the real party in interest herein. 
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(2) RELATED APPEALS AND INTERFERENCES 
Appellants, their legal representative and the assignee are not aware of any related 
appeals or interferences which will directly affect or be directly affected by or have a bearing on 
the Board's decision in the instant appeal. 



Claims rejected: 
Claims allowed: 
Claims canceled: 
Claims withdrawn: 
Claims on Appeal: 



(3) STATUS OF THE CLAIMS 
Claims 1, 2, 21, 22, and 27-29 
none 

Claims 3-13 and 19-20 
Claims 14-18, 23-26, and 30-32 

Claims 1, 2, 21, 22, and 27-29 (A copy of the claims on appeal, as 
amended, can be found in the attached Appendix.) 



(4) STATUS OF AMENDMENTS AFTER FINAL 
The Amendment after Final Rejection under 37 C.F.R. § 1.1 16, mailed May 9, 2001, has 
been entered. See the Interview Summary of August 15, 2001. 

(5) SUMMARY OF THE INVENTION 
Appellants' invention is directed to polypeptides, human prostate growth-associated 
membrane proteins PGAMP-1 and PGAMP-2, comprising the amino acid sequences of SEQ ID 
NO:l and SEQ ID NO:2, respectively (Specification, e.g., at page 13, line 33 to page 14, line 1; 
and page 14, lines 20-21). Appellants' invention also includes polypeptides comprising a 
naturally-occurring amino acid sequence having at least 90% amino acid sequence identity to 
either SEQ ID NO:l or SEQ ID NO:2 (e.g., at page 15, lines 9-11), polypeptides comprising 
polypeptide fragments consisting of at least 15 contiguous amino acids of SEQ ID NO:l (e.g., at 
page 7, lines 9-12), and compositions comprising polypeptides of SEQ ID NO:l and SEQ ID 
NO:2 (e.g., at page 26, lines 17-19). 

PGAMP-1 and PGAMP-2 have strong chemical and structural homology to rat heat- 
stable antigen CD4 (GenBank ID 1216498; SEQ ID NO: 5), mouse apoptosis-associated tyrosine 
kinase (GenBank ID 2459993; SEQ ID NO:6), and human prostate specific antigen (GenBank ID 
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130989; SEQ ID NO:7) (Specification, e.g., at page 14, lines 6-7; page 14, lines 30-32; and 
Figures 1 and 2). In particular, PGAMP-1 and rat heat-stable antigen CD4 share 21% identity, 
PGAMP-2 and a fragment of mouse apoptosis-associated tyrosine kinase share 17% identity, and 
PGAMP-2 and human prostate specific antigen share 18% identity (e.g., at page 14, lines 7-8; 
and page 14, lines 32-33). In addition: 

1. "PGAMP-1 is 141 amino acids in length and has one potential casein kinase II 
phosphorylation site at residue S35; one potential protein kinase C phosphorylation site at 
residue S15; one potential tyrosine kinase phosphorylation site at residue Y110; three 
potential transmembrane regions between about residues 144 to P67, 181 to W102, and 
PI 17 to Q 135; and has chemical similarity with CD44 antigen precursor. . . Northern 
analy sis shows the <^pressionoHlTi^seq^^ 72% of which 

,cr ~ai^inrmortalized or cancerous and at least 18% of which involve immune response. Of 
particular note is the expression of PGAMP in cancerous or hyperplastic prostate (48%) 
and breast (7%); and in brain and adrenal gland." (Specification at page 14, lines 1-13) 

2. "PGAMP-2 is 410 amino acids in length and has a potential N-glycosylation site at 
residue N273; one potential cAMP- and cGMP-dependent protein kinase phosphorylation 
site at residue S355; one potential casein kinase II phosphorylation site at residue S274; 
seven potential protein kinase C phosphorylation sites at residues T118, S121, T131, 
S274, S311, S366, and S378; one potential tyrosine kinase phosphorylation site at residue 
Y21. In addition a hydropathy plot of PGAMP-2 predicts nine potential transmembrane 
regions between about residues L16 to Y31, P37 td V49, Q51 to Q73, V76 to L92, N101 
to Tl 18, F137 to F155, 1165 to P182, R230 to W251, and T400 to V410; and a potential 
signal peptide sequence from Ml to S12. . .T he three proteins TPGAMP-2. mou se 
apoptos is^asspci ated tvrosine.kinase, and human prostate s pe_cjjEic-.antisen-l-alsQ-share"six 
transmembrane regio ns and the potential signal peptide . In addition, PGAMP-2 and 

^hum an PSA have rat her similar isoelectric point s, 8.7 and 7.5, re spectively. . . Northern 
analysis shows the expressiorfSf this-sequence in various libraries, at least 76% of which 
are immortalized or cancerous and at least 18% of which involve immune response. Of 
particular note is the expression of PGAMP-2 in cancerous or hyperplastic prostate (28%) 
and breast (10%); and in uterus, ovary, and colon." (Specification at page 14, lines 21-33 
and page 15, lines 1-8) 

The polypeptides of the present invention are useful, for example, for toxicology testing, 
drug discovery, and disease diagnosis. 



(6) THE REJECTIONS 

Claims 1, 2, 21, 22, and 27-29 stand rejected under 35 U.S.C. § 101 based on the 
allegation that the claimed invention is not supported by either a specific and substantial asserted 
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i 

utility or a well established utility (Office Action, Feb. 9, 2001, page 5, section 7). Claims 1, 2, 
21, 22, and 27-29 also stand rejected under 35 U.S.C. § 112, first paragraph, based on the 
allegation that the specification does not enable one of skill in the art to make and/or use the 
claimed invention (Office Action, Feb. 9, 2001, page 3, section 5; and page 6, section 7). These 
rejections allege in particular that: 

• "Mere expression of PGAMP-1 or PGAMP-2 molecules in tissues does not mean 
effective treatment of disorders. There continues to be no objective evidence of record to 
show that these molecules can be used in the identification of cancerous tissue and as 
selective markers just for prostate cancer. Asserting that the claimed invention has strong 
chemical and structural homology to three other[s] does not support the molecules' utility 
nor applicability as therapeutic or pharmaceutical agents" (Addendum to Advisory 
Action, July 18, 2001, paragraph A). 

Claim 2 stands rejected under 35 U.S.C. § 1 12, first paragraph, based on the allegation 
that the specification does not reasonably provide enablement commensurate with the scope of 
the claimed invention (Addendum to Advisory Action, July 18, 2001, paragraph B). This 
rejection alleges in particular that: 

• "Applicants have yet to identify the amino acid residues that could [be] mutated, deleted 
or substituted that could yield polypeptides that would retain the structure and function in 
the same manner as Applicants have alleged" (Addendum to Advisory Action, July 18, 
2001, paragraph B). 

Claims 1, 2, 21, 22, and 27-29 stand rejected under 35 U.S.C. § 112, first paragraph, 
based on the allegation that the claims contain subject matter which was not described in the 
specification in such a way as to enable one skilled in the art to make and/or use the invention 
(Office Action, February 9, 2001, page 3, section 5). This rejection alleges in particular that: 

• "One skilled in the art cannot rely on Appliants expressed opinions and the percentages 
recited in the specification and in Applicants' arguements of Paper 12, pages 6 and 7 as 
definitive proof that SEQ ID Nos:l and 2 as compounds that possibly can be used as 
diagnostic tools, therapeutics agents or as pharmaceutical agents. Thus, undue 



82481 



4 



09/397,558 



Docket No.: PF-0527-1 DIV 

experimentation would be required to use the instantly claimed polypeptides." (Office 
Action, February 9, 2001, page 4). 

(7) ISSUES 

1. Whether claims 1, 2, 21, 22, and 27-29 meet the utility requirement of 35 U.S.C. § 

101. 

2. Whether claims 1, 2, 21, 22, and 27-29 meet the enablement requirement of 35 U.S.C. 
§ 112, first paragraph, i.e., would the Specification enable one of ordinary skill in the art to make 
and use the claimed sequences, e.g., in toxicology testing, drug development, and the diagnosis 
of disease. 

3. Whether claim 2 meets the enablement requirement of 35 U.S.C. § 1 12, first 
paragraph, with respect to "variant" polypeptide sequences. 

4. Whether claims 1, 2, 21, 22, and 27-29 meet the enablement requirement of 35 U.S.C. 
§112, first paragraph, with respect to whether undue experimentation would be required to use 
the claimed invention. 

(8) GROUPING OF THE CLAIMS 

As to Issue 1 

All of the claims on appeal are grouped together. 
As to Issue 2 

All of the claims on appeal are grouped together. 
As to Issue 3 

Claim 2 is grouped by itself. 
As to Issue 4 

All of the claims on appeal are grouped together. 
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(9) APPELLANTS' ARGUMENTS 

Issue 1 - Whether the claims meet the utility requirement of 35 U.S.C. § 101 

The rejection of claims 1, 2, 21, 22, and 27-29 is improper, as the claims have a 
patentable utility as set forth in the instant specification, and/or a utility well-known to one 
of ordinary skill in the art 

The invention at issue includes polypeptide sequences expressed in human tissues, 
including reproductive, neuronal, and gastrointestinal tissues, and tissues associated with cancer 
and the immune response (Specification, e.g., at page 25, lines 15-16; and page 25, lines 20-21). 
As such, the claimed inventions have numerous practical, beneficial uses in toxicology testing, 
drug development, and the diagnosis of disease, none of which necessarily require detailed 
knowledge of how the polypeptides work. As a result of the benefits of these uses, the claimed 
inventions already enjoy significant commercial success. 

Any of these uses meets the utility requirements of 35 U.S.C. § 101 and, derivatively, 

§ 1 12, first paragraph. Under these sections of the Patent Act, the patent applicant need only 

show that the claimed invention is "practically useful,' 1 Anderson v. Natta, 480 F.2d 1392, 1397, 

178 USPQ 458 (CCPA 1973) and confers a "specific benefit" on the public. Brenner v. Manson, 

383 U.S. 519, 534-35, 148 USPQ 689 (1966). As discussed in a recent Court of Appeals for the 

Federal Circuit case, this threshold is not high: 

An invention is "useful" under section 101 if it is capable of providing some identifiable 
benefit. See Brenner v. Manson, 383 U.S. 519, 534 [148 USPQ 689] (1966); Brooktree 
Corp. v. Advanced Micro Devices, Inc., 977 F.2d 1555, 1571 [24 USPQ2d 1401] (Fed. 
Cir. 1992) ("to violate Section 101 the claimed device must be totally incapable of 
achieving a useful result"); Fuller v. Berger, 120 F. 274, 275 (7th Cir. 1903) (test for 
utility is whether invention "is incapable of serving any beneficial end"). 

Juicy Whip Inc. v. Orange Bang Inc., 51 USPQ2d 1700 (Fed. Cir. 1999). In Stiftung v. Renishaw 

PLC, 945 F.2d 1 173, 1 180, 20 USPQ2d 1094 (Fed. Cir. 1991) the United States Court of Appeal 

for the Federal Circuit explained: 

An invention need not be the best or only way to accomplish a certain result, and it need 
only be useful to some extent and in certain applications: "[T]he fact that an invention 
has only limited utility and is only operable in certain applications is not grounds for 
finding lack of utility." Envirotech Corp. v. Al George, Inc., 730 F.2d 753, 762, 221 
USPQ 473, 480 (Fed. Cir. 1984). 
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If persons of ordinary skill in the art would understand that there is a "well-established" 
utility for the claimed invention, the threshold is met automatically and the applicant need not 
make any showing to demonstrate utility. Manual of Patent Examination Procedure at 
§ 706.03(a). Only if there is no "well-established" utility for the claimed invention must the 
applicant demonstrate the practical benefits of the invention. Id. 

Once the patent applicant identifies a specific utility, the claimed invention is presumed 
to possess it. In re Cortright, 165 F.3d 1353, 1357, 49 USPQ2d 1464; In re Brana, 51 F.3d 
1560, 1566; 34 USPQ2d 1436 (Fed. Cir. 1995). In that case the Patent Office bears the burden to 
demonstrate that a person of ordinary skill in the art would reasonably doubt that the asserted 
utility could be achieved by the claimed invention. Ids. To do so, the PTO must provide 
evidence or sound scientific reasoning. See In re hanger, 503 F.2d 1380, 1391-92, 183 USPQ 
288 (CCPA 1974). If and only if the Patent Office makes such a showing, the burden shifts to 
the applicant to provide rebuttal evidence that would convince the person of ordinary skill that 
there is sufficient proof of utility. Brana, 51 F.3d at 1566. The applicant need only prove a 
"substantial likelihood" of utility; certainty is not required. Brenner, 383 U.S. at 532. 

The rejections fail to demonstrate either that the Applicants' assertions of utility are 
legally insufficient or that a person of ordinary skill in the art would reasonably doubt that they 
could be achieved. For these reasons alone the rejections should be withdrawn. 

There is, however, an additional, independent reason to overturn the rejections: to the 
extent the rejections are based on Revised Interim Utility Examination Guidelines (64 FR 71427, 
December 21, 1999), the final Utility Examination Guidelines (66 FR 1092, January 5, 2001), 
and/or the Revised Interim Utility Guidelines Training Materials (USPTO Website 
www.uspto.gov, March 1, 2000), the Guidelines and Training Materials are themselves 
inconsistent with the law. These inconsistencies are discussed separately below. 

I. Use in drug discovery as screening tools for identifying agonists and antagonists; as 
diagnostics for cancer and reproductive and immunological disorders; as controls 
for monitoring expression of polypeptides in the monitoring of disease progression; 
as well as to develop and monitor the activities of therapeutic agents, and in 
particular, for the well-known specific use in toxicological studies for new drug 
development, are sufficient utilities under 35 U.S.C. §§ 101 and 112, first paragraph. 
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The claimed invention meets all of the necessary requirements for establishing a credible 
utility under the Patent Law: There is a "well-established" use for the claimed invention, there 
are specific practical and beneficial uses for the invention, and those uses are substantial. 
Objective evidence, not considered by the Patent Office, further corroborates the credibility of 
the asserted utilities. 

A. The use of human polynucleotides and their encoded polypeptides as tools for 
toxicology testing, drug discovery, and the diagnosis of disease, is "well- 
established." 

In recent years, scientists have developed important techniques for toxicology testing, 
drug development, and disease diagnosis. Many of these techniques rely on expression profiling, 
in which the expression of numerous genes is compared in two or more samples. Genes or gene 
fragments known to be expressed are tools essential to any technology that uses expression 
profiling. Likewise, proteome expression profiling techniques have been developed in which the 
expression of numerous polypeptides is compared in two or more samples. Polypeptides known 
to be expressed, such as the invention at issue, are tools essential to any technology that uses 
proteome expression profiling. See, e.g., Sandra Steiner and N. Leigh Anderson, Expression 
profiling in toxicology — potentials and limitations . Toxicology Letters 112-113:467 (2000). 

The technologies made possible by expression profiling and the DNA and polypeptide 
tools upon which they rely are now well-established. The technical literature recognizes not only 
the prevalence of these technologies, but also their unprecedented advantages in drug 
development, testing and safety assessment. One of these techniques is toxicology testing, used 
in both drug development and safety assessment. Toxicology testing is now standard practice in 
the pharmaceutical industry. See, e.g., John C. Rockett, et al., Differential gene expression in </ 
drug metabolism and toxicology: practicalities, problems, and potential Xenobiotica 29(7):655, 
656 (1999): 

Knowledge of toxin-dependent regulation in target tissues is not solely an academic 
pursuit as much interest has been generated in the pharmaceutical industry to harness this 
technology in the early identification of toxic drug candidates, thereby shortening the 
developmental process and contributing substantially to the safety assessment of new 
drugs. 

To the same effect are several other scientific publications, including Emile F. Nuwaysir, et al., 
Microarravs and Toxicology: The Advent of Toxicogenomics , Molecular Genesis 24:153 (1999); 
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Sandra Steiner and N. Leigh Anderson, supra. 

Nucleic acids useful for measuring the expression of whole classes of genes are routinely 

incorporated for use in toxicology testing. Nuwaysir et al. describes, for example, a Human 

ToxChip comprising 2089 human clones, which were selected 

... for their well-documented involvement in basic cellular processes as well as their 
responses to different types of toxic insult. Included on this list are DNA replication and 
repair genes, apoptosis genes, and genes responsive to PAHs and dioxin-like compounds, 
peroxisome proliferators, estrogenic compounds, and oxidant stress. Some of the other 
categories of genes include transcription factors, oncogenes, tumor suppressor genes, 
cyclins, kinases, phosphatases, cell adhesion and motility genes, and homeobox genes. 
Also included in this group are 84 housekeeping genes, whose hybridization intensity is 
averaged and used for signal normalization of the other genes on the chip. 

See also Table 1 of Nuwaysir et al. (listing additional classes of genes deemed to be of special 

interest in making a human toxicology microarray). 

The more genes that are available for use in toxicology testing, the more powerful the 
technique. "Arrays are at their most powerful when they contain the entire genome of the species 
they are being used to study." John C. Rockett and David J. Dix, Application of DNA Arrays to 
Toxicology , Environ. Health Perspec. 107:681, No. 8 (1999). Control genes are carefully 
selected for their stability across a large set of array experiments in order to best study the effect 
of toxicological compounds. See attached email from the primary investigator, Dr. Cynthia 
Afshari to an Incyte employee, dated July 3, 2000, as well as the original message to which she 
was responding. Thus, there is no expressed gene which is irrelevant to screening for 
toxicological effects, and all expressed genes have a utility for toxicological screening. This is 
true for both polynucleotides and the polypeptides encoded by them. 

There are numerous additional uses for the information made possible by expression 
profiling. Expression profiling is used to identify drug targets and characterize disease. See 
Rockett et al., supra. It also is used in tissue profiling, developmental biology, disease staging, 
etc. There is simply no doubt that the sequences of expressed human genes all have practical, 
substantial and credible real-world utilities, at the very least for expression profiling. 

Expression profiling technology is also used to identify drug targets and analyze disease 
at the molecular level, thus accelerating the drug development process. For example, expression 
profiling is useful for the elucidation of biochemical pathways, each pathway comprising a 
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multitude of component polypeptides and thus providing a pool of potential drug targets. In this 

manner, expression profiling leads to the optimization of drug target identification and a 

comprehensive understanding of disease etiology and progression. 

There is simply no doubt that the sequences of expressed human polynucleotides and 

polypeptides all have practical, substantial and credible real-world utilities, at the very least for 

biochemical pathway elucidation, drug target identification, and assessment of toxicity and 

treatment efficacy in the drug development process. Sandra Steiner and N. Leigh Anderson, 

supra, have elaborated on this topic as follows: 

The rapid progress in genomics and proteomics technologies creates a unique 
opportunity to dramatically improve the predictive power of safety assessment and to 
accelerate the drug development process. Application of gene and protein expression 
profiling promises to improve lead selection, resulting in the development of drug 
candidates with higher efficacy and lower toxicity. The identification of biologically 
relevant surrogate markers correlated with treatment efficacy and safety bears a great 
potential to optimize the monitoring of pre-clinical and clinical trials. 

In fact, the potential benefit to the public, in terms of lives saved and reduced health care 

costs, are enormous. Recent developments provide evidence that the benefits of this information 

are already beginning to manifest themselves. Examples include the following: 

• In 1999, CV Therapeutics, an Incyte collaborator, was able to use Incyte gene 
expression technology, information about the structure of a known transporter 
gene, and chromosomal mapping location, to identify the key gene associated 
with Tangiers disease. This discovery took place over a matter of only a few 
weeks, due to the power of these new genomics technologies. The discovery 
received an award from the American Heart Association as one of the top 10 
discoveries associated with heart disease research in 1999. 



• In an April 9, 2000, article published by the Bloomberg news service, an Incyte 
customer stated that it had reduced the time associated with target discovery and 
validation from 36 months to 18 months, through use of Incyte' s genomic 
information database. Other Incyte customers have privately reported similar 
experiences. The implications of this significant saving of time and expense for 
the number of drugs that may be developed and their cost are obvious. 



• In a February 10, 2000, article in the Wall Street Journal, one Incyte customer 

stated that over 50 percent of the drug targets in its current pipeline were derived 
from the Incyte database. Other Incyte customers have privately reported similar 
experiences. By doubling the number of targets available to pharmaceutical 
researchers, Incyte genomic information has demonstrably accelerated the 
development of new drugs. 
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Because the Patent Examiner failed to address or consider the "well-established" utilities 

for the claimed invention in toxicology testing, drug development, and the diagnosis of disease, 

the Examiner's rejections should be withdrawn regardless of their merit. 

B. The use of the claimed polypeptides for toxicology testing, drug discovery, 

and disease diagnosis are practical uses that confer "specific benefits" to the 
public. 

Even if, arguendo, toxicology testing, drug development and disease diagnosis (through 
expression profiling) are not well-established utilities (which expressly is not conceded), the 
claimed invention nonetheless has specific utility by virtue of its use in each of these techniques. 
There is no dispute that the claimed invention is in fact a useful tool in each of these techniques. 
That is sufficient to establish utility for both the polypeptides and the polynucleotides encoding 
them. 

Nevertheless, the claimed invention is rejected on the grounds that it does not have a 

"specific utility" absent a detailed description of the actual function of the claimed protein or 

identification of a "specific" disease it can be used to diagnose or treat. Apparently relying on 

the Training Materials, the rejection is made based on a scientifically incorrect and legally 

unsupportable assertion that identification of the family or families of proteins, without more, 

does not satisfy the utility requirement. None of these grounds is consistent with the law. 

1. A patent applicant can specify a utility without any knowledge as to 
how or why the invention has that utility. 

It is settled law that how or why any invention works is irrelevant to determining utility 
under 35 U.S.C. § 101: "[I]t is not a requirement of patentability that an inventor correctly set 
forth, or even know, how or why the invention works." In re Cortright, 165 F.3d, at 1359 
(quoting Newman v. Quigg, 877 F.2d 1575, 1581, 11 USPQ2d 1340 (Fed. Cir. 1989)). See also 
Fromson v. Advance Offset Plate, Inc., 720 F.2d 1565, 1570, 219 USPQ 1137 (Fed. Cir. 1983) 
("[I]t is axiomatic that an inventor need not comprehend the scientific principles on which the 
practical effectiveness of his invention rests."). It follows that the patent applicant need not set 
forth the particular functionality of the claimed invention to satisfy the utility requirement. 

Practical, beneficial use, not functionality, is at the core of the utility requirement. Supra 
(introduction to § I). So long as the practical benefits are apparent from the invention without 
speculation, the requirement is satisfied. Standard Oil Co. v. Montedison, 664 F.2d 356, 374, 
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212 U'SPQ 327 (3d Cir. 1981); see also Brana, 51 F.3d at 1565. To state that a biological mole- 
cule might be useful to treat some unspecified disease is not, therefore a specific utility. In re 
Kirk, 376 F.2d 936, 945, 153 USPQ 48 (C.C.P.A. 1967). The molecule might be effective, and it 
might not. 

However, unlike the synthetic molecules of Kirk, the claimed invention is known to be 
useful. It is not just a random sequence of speculative use. Because it is expressed in humans, a 
person of ordinary skill in the art would know how to use the claimed polypeptide sequences ~ 
without any guesswork — in toxicology testing, drug development, and disease diagnosis 
regardless of how the protein actually functions. The claimed invention could be used, for 
example, in a toxicology test to determine whether a drug or toxin causes any change in the 
expression of molecules involved in cancer, or reproductive or immunological disorders. 
Similarly, the claimed invention could be used to determine whether a specific medical 
condition, such as prostate cancer, affects the expression of PGAMP proteins, and, perhaps in 
conjunction with other information, serve as a marker for or to assess the stage of a particular 
disease or condition. 

In fact, the claimed invention could be used in toxicology testing and diagnosis without 
any knowledge (although this is not the case here) of the precise function of the protein: it could 
serve, for example, as a marker of a toxic response, or, alternatively, if levels of the claimed 
polypeptide remain unchanged during a toxic response, as a control in toxicology testing. 
Diagnosis of disease (or fingerprinting using expression profiles) can be achieved using arrays of 
numerous identifiable, expressed DNA sequences, or by two-dimensional gel analysis of the 
expressed proteins themselves, notwithstanding lack of any knowledge of the function of the 
proteins. 

The claimed polypeptides can be used in protein expression analysis techniques such as 
2-D PAGE gels and western blots. Using the claimed invention with these techniques, persons 
of ordinary skill in the art can better assess, for example, the potential toxic affect of a drug 
candidate. The Patent Examiner does not dispute that the claimed polypeptide can be used in 2- 
D PAGE gels and western blots to perform drug toxicity testing. Instead, the Patent Examiner 
contends that the claimed polypeptide cannot be useful without precise knowledge of its 
function. But the law never has required knowledge of biological function to prove utility. It is 
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the claimed invention's uses, not its functions, that are the subject of a proper analysis under the 
utility requirement. 

2. A patent applicant may specify a utility that applies to a broad class of 
inventions. 

The fact that the claimed invention is a member of a broad class (such as DNA sequences 
expressed in humans, or the human proteins they encode) that includes sequences other than 
those claimed that also have utilities in toxicology testing, drug discovery, disease diagnosis, etc. 
does not negate utility. Practical utilities can be directed to classes of inventions, irrespective of 
function, so long as a person of ordinary skill in the art would understand how to achieve a 
practical benefit from knowledge of the class. Montedison, 664 F.2d at 374-75. The law has 
long assumed that inventions that achieve a practical use also achieved by other inventions 
satisfy the utility requirement. For example, many materials conduct electricity. Likewise, many 
different plastics can be used to form useful films. Montedison, 664 F.2d at 374-75; Natta, 480 
F.2d at 1397. This is a general utility (practical films) that applies to a broad class of inventions 
(plastics) which satisfies the utility requirement of 35 U.S.C. § 101. 

Not all broad classes of inventions are, by themselves, sufficient to inform a person of 
ordinary skill in the art of the practical utility for a member of the class. Some classes may 
indeed convey too little information to a person of ordinary skill in the art. These may include 
classes of inventions that include both useful and nonuseful members. See In re Ziegler, 992 
F.2d 1197, 1201, 26 USPQ2d 1600 (Fed. Cir. 1993). In some of these cases, further experimen- 
tation would be required to determine whether or not a member of the class actually has a 
practical use. Brenner, 383 U.S. at 534-35. 

The broad class of steroids identified in Kirk is just such a class. It includes natural 
steroids (concededly useful) and man-made steroids, some of which are useful and some of 
which are not. Indeed, only a small fraction of the members of this broad class of invention may 
be useful. Without additional information or further experimentation, a person of ordinary skill 
in the art would not know whether a member of the class falls into the useful category or not. 
This could also be the case for the broad class of "plastic-like" polypropylenes in Ziegler, which 
includes many — perhaps predominately — useless members. 

The PTO routinely issues patents whose utility is based solely on the claimed inventions' 
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membership in a class of useful things. The PTO presumably would issue a patent on a novel 
and nonobvious fishing rod notwithstanding the lack of any disclosure of the particular fish it 
might be used to catch. The standard being promulgated in the Guidelines and in particular as 
exemplified in the Training Materials, and being applied in the present rejection, would appear to 
warrant a rejection, however, on the grounds that the use of the fishing rod is applicable to the 
general class of devices used to catch fish. 

The PTO must apply the same standard to the biotechnological arts that it applies to fields 
such as plastics and fishing equipment. In re Gazave, 379 F.2d 973, 977-78, 154 USPQ 92 
(CCPA 1967) quoting In re Chilowsky, 299 F.2d 457, 461, 108 USPQ 321 (CCPA 1956) ("[T]he 
same principles should apply in determining operativeness and sufficiency of disclosure in 
applications relating to nuclear fission art as in other cases."); see also In re Alappat, 33 F.3d 
1526, 1566, 31 USPQ2d 1545 (Fed. Cir. 1994) (Archer, C.J., concurring in part and dissenting in 
part) ("Discoveries and inventions in the field of digital electronics are analyzed according to the 
aforementioned principles [concerning patentable subject matter] as any other subject matter."). 
Indeed, there are numerous classes of inventions in the biotechnological arts that satisfy the 
utility requirement. 

Take, for example, the class of interleukins expressed in human cells of the immune 
system. Unlike the classes of steroids or plastic-like polypropylenes in Kirk and Ziegler, all of 
the members of this class have practical uses well beyond "throwaway" uses. All of them cause 
some physiological response (in cells of the immune system). All of the genes encoding them 
can be used for toxicology testing to generate information useful in activities such as drug 
development, even in cases where little is known as to how a particular interleukin works. No 
additional experimentation would be required, therefore, to determine whether an interleukin has 
a practical use. It is well-known to persons of ordinary skill in the art that there is no such thing 
as a useless interleukin. 

Because all of the interleukins, as a class, convey practical benefit (much like the class of 
DNA ligases identified in the Training Materials), there is no need to provide additional infor- 
mation about them. A person of ordinary skill in the art need not guess whether any given inter- 
leukin conveys a practical benefit or how that particular interleukin works. 

Another example of a class that by itself conveys practical benefits is the G protein- 



82481 



14 



09/397,558 



Docket No.: PF-0527-1 DIV 

coupleci receptors ("GPCRs"). GPCRs are well-known as intracellular signaling mediators with 
diverse functions critical to complex organisms. They perform these functions by binding to and 
interacting with specific ligands. They are targets of many current drug treatments, including 
anti-depressants, anti-histamines, blood pressure regulators, and opiates. 

Newly-identified GPCRs are used intensively in the real-world, even in cases where 
neither the specific ligand that binds to the GPCR or the precise biological function of the GPCR 
is known. Newly identified GPCRs are used, for example, as toxicity controls for drug candi- 
dates known to bind other GPCRs. Because a person of ordinary skill in the art would know how 
to use any GPCR to achieve a practical benefit, even without any detailed or particular know- 
ledge as to how it works, GPCRs as a class meet the utility requirement. 

In fact, all isolated and purified naturally-occurring polynucleotide and polypeptide 
sequences which are expressible (i.e., which are not pseudogenes that are never expressed during 
any natural biological process) can be and are used in a real-world context as tools for 
toxicological testing, e.g., for drug discovery purposes. This utility applies to all sequences 
actually expressed, yet in each case, the utility of the sequence is quite specific, e.g., monitoring 
the expression of any particular polynucleotide or polypeptide sequence is a utility specific to 
that particular sequence. 

Prostate growth-associated membrane proteins, like interleukins, GPCRs and fishing 
rods, is a class that by itself conveys practical benefits. Unlike steroids and "plastic-like" 
polypropylenes, all of the claimed prostate growth-associated membrane proteins are expressed 
by humans, and all of them can be used as tools for toxicology testing. The claimed invention 
could be used, for example to determine whether a drug candidate affects the expression of 
prostate growth-associated membrane proteins in humans, how it does so, and to what extent. 
Just as there are no useless interleukins and GPCRs, there are no useless prostate growth- 
associated membrane proteins of the claimed invention. As these are practical, real-world uses, 
the application need not describe particular functionality or medical applications that would only 
supplement the utilities known to exist already. 

C. Because the use of prostate growth-associated membrane proteins in 

toxicology testing, drug discovery, and disease diagnosis are practical uses 
beyond mere study of the invention itself, the claimed invention has 
substantial utility. 
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In addition to conferring a specific benefit on the public, the benefit must also be 
"substantial." Brenner, 383 U.S. at 534. A "substantial" utility is a practical, "real-world" utility. 
Nelson v. Bowler, 626 F.2d 853, 856, 206 USPQ 881 (CCPA 1980). 

The claimed invention's use as a tool for toxicology testing is just such a practical, real- 
world use. Although the rejection is not expressly based on lack of practical utility, and/or 
ignores this basis for utility, as stated it is tantamount to a rejection based on the sequence being 
only a research tool, on the ground that the use of an invention as a tool for research is not a 
"substantial" use. Because the PTO's rejection in this light assumes a substantial overstatement 
of the law, it must be withdrawn. 

There is no authority for the proposition that use as a tool for research is not a substantial 
utility. In fact, the PTO issues patents for inventions whose only use is to facilitate research, 
such as DNA ligases. These are acknowledged by the PTO's Training Materials themselves to 
be useful. 

Only a limited subset of research uses are not "substantial" utilities: those in which the 
only known use for the claimed invention is to be an object of further study, thus merely inviting 
further research. This follows from Brenner, in which the U.S. Supreme Court held that a 
process for making a compound does not confer a substantial benefit where the only known use 
of the compound was to be the object of further research to determine its use. Id. at 535. 
Similarly, in Kirk, the CCPA held that a compound would not confer substantial benefit on the 
public merely because it might be used to synthesize some other, unknown compound that would 
confer substantial benefit. Kirk, 376 F.2d at 940, 945 ("What appellants are really saying to 
those in the art is take these steroids, experiment, and find what use they do have as medicines."). 
Nowhere do those cases state or imply, however, that a material cannot be patentable if it has 
some other beneficial use in research. 

As used in toxicology testing, drug discovery, and disease diagnosis, the claimed 
invention has a beneficial use in research other than studying the claimed invention. It is a tool, 
rather than an object, of research. The claimed invention has numerous other uses as a research 
tool, each of which alone is a "substantial utility". These include diagnostic assays (pages 33- 
36), drug screening (page 38), etc. 

D. Objective evidence corroborates the utilities of the claimed invention. 
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There is in fact no restriction on the kinds of evidence a Patent Examiner may consider in 
determining whether a "real-world" utility exists. Indeed, "real-world" evidence, such as 
evidence showing actual use or commercial success of the invention, can demonstrate conclusive 
proof of utility. Raytheon v. Roper, 220 USPQ2d 592 (Fed. Cir. 1983); Nestle v. Eugene, 55 
F.2d 854, 856, 12 USPQ 335 (6th Cir. 1932). Indeed, proof that the invention is made, used or 
sold by any person or entity other than the patentee is conclusive proof of utility. United States 
Steel Corp. v. Phillips Petroleum Co., 865 F.2d 1247, 1252, 9 USPQ2d 1461 (Fed. Cir. 1989). 

Over the past several years, a vibrant market has developed for databases containing all 
expressed genes (along with the polypeptide translations of those genes), in particular genes 
having medical and pharmaceutical significance such as the instant sequence. (Note that the 
value in these databases is enhanced by their completeness, but each sequence in them is 
independently valuable.) The databases sold by Applicants' assignee, Incyte, include exactly the 
kinds of information made possible by the claimed invention, such as tissue and disease 
associations. Incyte sells its database containing the claimed sequence and millions of other 
sequences throughout the scientific community, including to pharmaceutical companies who use 
the information to develop new pharmaceuticals. 

II. The Patent Examiner failed to demonstrate that a person of ordinary skill in the art 
would reasonably doubt the utility of the claimed invention* 

In addition to alleging a "specific" use for the claimed subject matter, a patent applicant 
must present proof that the claimed subject matter is in fact useful. Brana, 51 F.3d at 1565-66. 
The applicant need only prove a "substantial likelihood" of utility; certainty is not required. 
Brenner, 383 U.S. at 532. 

The amount of evidence required to prove utility depends on the facts of each particular 
case. In re Jolles, 628 F.2d 1322, 1326, 206 USPQ 885 (CCPA 1980). "The character and 
amount of evidence may vary, depending on whether the alleged utility appears to accord with or 
to contravene established scientific principles and beliefs." Id. Unless there is proof of "total 
incapacity," or there is a "complete absence of data" to support the applicant's assertion of utility, 
the utility requirement is met. Brooktree Corp. v. Advanced Micro Devices, Inc., 977 F.2d 1555, 
1571, 24 USPQ2d 1401 (Fed. Cir. 1992); Envirotech, 730 F.2d at 762. 
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'A patent applicant's assertion of utility in the disclosure is presumed to be true and 
correct. In re Cortright, 165 F.3d at 1356; Brana, 51 F.3d at 1566. If such an assertion is made, 
the Patent Office bears the burden in the first instance to demonstrate that a person of ordinary 
skill in the art would reasonably doubt that the asserted utility could be achieved. Ids. To do so, 
the PTO must provide evidence or sound scientific reasoning. See Longer, 503 F.2d at 1391-92. 
If and only if the Patent Office makes such a showing, the burden shifts to the applicant to 
provide rebuttal evidence that would convince the person of ordinary skill that there is sufficient 
proof of utility. Brana, 51 F.3d at 1566. The Revised and final Utility Guidelines are in 
agreement with this procedure. See Revised and final Guidelines at f f 3-4. 

The issue of proof often arises in the chemical and biotechnological arts when the paten- 
tee asserts a utility for a claimed chemical compound based on its homology or similarity to 
another compound having a known, established utility. In such cases, the applicant can demon- 
strate "substantial likelihood" of utility by demonstrating a "reasonable correlation" between the 
utility ~ not the function — of the known compound and the compound being claimed. Fujikawa 
v. Wattanasin, 93 F.3d 1559, 1565, 39 USPQ2d 1895 (Fed. Cir. 1996). Accordingly, under 
Brana, the Patent Office must accept the asserted utility unless it can show that a person of 
ordinary skill in the art would reasonably doubt that a "reasonable correlation" exists. 

In the instant case, the Patent Examiner has not addressed the asserted utilities of the 
claimed polypeptides as research tools, for example, in toxicology testing, diagnostic assays, and 
drug screening. As such, the Patent Office has not met its burden to demonstrate that a person of 
ordinary skill in the art would reasonably doubt that the asserted utilities could be achieved. 
Therefore, the rejection of the claimed invention for lack of utility under 35 U.S.C. § 101 should 
be reversed. 

III. By requiring the Patent Applicant to assert a particular or unique utility, the Patent 
Examination Utility Guidelines and Training Materials applied by the Patent 
Examiner misstate the Law. 

The Training Materials, which direct the Examiners regarding how to apply the Utility 
Guidelines, address the issue of specificity with reference to two kinds of asserted utilities: 
"specific" utilities which meet the statutory requirements, and "general" utilities which do not. 
The Training Materials define a "specific utility" as follows: 
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A [specific utility] is specific to the subject matter claimed. This contrasts to general 
utility that would be applicable to the broad class of invention. For example, a claim to a 
polynucleotide whose use is disclosed simply as "gene probe" or "chromosome marker" 
would not be considered to be specific in the absence of a disclosure of a specific DNA 
target. Similarly, a general statement of diagnostic utility, such as diagnosing an 
unspecified disease, would ordinarily be insufficient absent a disclosure of what condition 
can be diagnosed. 

The Training Materials distinguish between "specific" and "general" utilities by assessing 
whether the asserted utility is sufficiently "particular," i.e., unique (Training Materials at p.52) as 
compared to the "broad class of invention." (In this regard, the Training Materials appear to 
parallel the view set forth in Stephen G. Kunin, Written Description Guidelines and Utility 
Guidelines , 82 J.P.T.O.S. 77, 97 (Feb. 2000) ("With regard to the issue of specific utility the 
question to ask is whether or not a utility set forth in the specification is particular to the claimed 
invention.")). 

Such "unique" or "particular" utilities never have been required by the law. To meet the 
utility requirement, the invention need only be "practically useful," Natta, 480 F.2d 1 at 1397, 
and confer a "specific benefit" on the public. Brenner, 383 U.S. at 534. Thus incredible, "throw- 
away" utilities, such as trying to "patent a transgenic mouse by saying it makes great snake food" 
do not meet this standard. Karen Hall, Genomic Warfare , The American Lawyer 68 (June 2000) 
(quoting John Doll, Chief of the Biotech Section of USPTO). 

This does not preclude, however, a general utility, contrary to the statement in the 
Training Materials where "specific utility" is defined (page 5). Practical real-world uses are not 
limited to uses that are unique to an invention. The law requires that the practical utility be 
"definite," not particular. Montedison, 664 F.2d at 375. Appellant is not aware of any court that 
has rejected an assertion of utility on the grounds that it is not "particular" or "unique" to the 
specific invention. Where courts have found utility to be too "general," it has been in those cases 
in which the asserted utility in the patent disclosure was not a practical use that conferred a 
specific benefit. That is, a person of ordinary skill in the art would have been left to guess as to 
how to benefit at all from the invention. In Kirk, for example, the CCPA held the assertion that a 
man-made steroid had "useful biological activity" was insufficient where there was no 
information in the specification as to how that biological activity could be practically used. Kirk, 
376 F.2d at 941. 
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The fact that an invention can have a particular use does not provide a basis for requiring 
a particular use. See Brana, supra (disclosure describing a claimed antitumor compound as 
being homologous to an antitumor compound having activity against a "particular" type of 
cancer was determined to satisfy the specificity requirement). "Particularity" is not and never has 
been the sine qua non of utility; it is, at most, one of many factors to be considered. 

As described supra, broad classes of inventions can satisfy the utility requirement so long 
as a person of ordinary skill in the art would understand how to achieve a practical benefit from 
knowledge of the class. Only classes that encompass a significant portion of nonuseful members 
would fail to meet the utility requirement. Supra § LB. 2 (Montedison, 664 F.2d at 374-75). 

The Training Materials fail to distinguish between broad classes that convey information 
of practical utility and those that do not, lumping all of them into the latter, unpatentable 
category of "general" utilities. As a result, the Training Materials paint with too broad a brush. 
Rigorously applied, they would render unpatentable whole categories of inventions heretofore 
considered to be patentable, and that have indisputably benefitted the public, including the 
claimed invention. See supra § LB. Thus the Training Materials cannot be applied consistently 
with the law. 

Issue 2 - Whether the claims meet the enablement requirement of 35 U.S.C. § 112, first 
paragraph, with respect to the "utility 11 issue 

To the extent the rejection of the patented invention under 35 U.S.C. § 112, first 
paragraph, is based on the improper rejection for lack of utility under 35 U.S.C. § 101, it 
must be reversed. 

The rejection set forth in the Office Action is based on the assertions discussed above, 
i.e., that the claimed invention lacks patentable utility. To the extent that the rejection under 
§ 1 12, first paragraph, is based on the improper allegation of lack of patentable utility under 
§ 101, it fails for the same reasons. 

Issue 3 - Whether claim 2 meets the enablement requirement of 35 U.S.C. § 112, first 
paragraph, with respect to "variant" polypeptide sequences 
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Claim 2 recites polypeptides comprising naturally-occurring amino acid sequences 
having at least 90% amino acid sequence identity to either SEQ ID NO: 1 or SEQ ID NO:2. This 
claim stands rejected under 35 U.S.C. § 112, first paragraph, based on the allegation that the 
specification does not reasonably provide enablement commensurate with the scope of the 
invention recited in this claim (Addendum to Advisory Action, July 18, 2001, paragraph B). The 
Patent Examiner stated that "Applicants are not entitled to all polypeptide variants that have at 
least 90% sequence identity" (Office Action, February 9, 2001, page 5, first paragraph), and that 
"Applicants have yet to identify the amino acid residues that could [be] mutated, deleted or 
substituted that could yield polypeptides that would retain the structure and function in the same 
manner as Applicants have alleged" (Addendum to Advisory Action, July 18, 2001, paragraph 



Claim 2 does not recite all polypeptide variants that have at least 90% sequence identity 
to either SEQ ED NO:l or SEQ ID NO:2. Rather, claim 2 recites only naturally-occurring 
amino acid sequences that have at least 90% sequence identity to either SEQ ID NO:l or SEQ ID 
NO:2. One of ordinary skill in the art would know how to obtain naturally-occurring amino acid 
sequences, and would recognize whether such sequences had at least 90% sequence identity to 
either SEQ ID NO:l or SEQ ID NO:2. Furthermore, one of ordinary skill in the art would know 
how to use the recited sequences as, for example, research tools in toxicology testing, diagnostic 
assays, or drug screening. Such uses do not require guidance as to which amino acid residues 
could be mutated, deleted, or substituted to yield polypeptides that would retain the structure and 
function of SEQ ID NO:l or SEQ ED NO:2, because these uses depend only on the fact that the 
recited polypeptides are expressed in a living organism. 

One of ordinary skill in the art would know how to make and use the polypeptide 
"variants" recited by claim 2. Therefore, reversal of this rejection of claim 2 under 35 U.S.C. § 
112, first paragraph, based on the alleged lack of enablement of the recited polypeptide 
"variants," is requested. 

Issue 4 - Whether claims 1, 2, 21, 22, and 27-29 meet the enablement requirement of 35 
U.S.C. § 112, first paragraph, with respect to whether undue experimentation would be 
required to use the claimed invention 

The requirements necessary to fulfill the enablement requirement of 35 U.S.C. 112, first 



B). 
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paragraph, are well established by case law. 

The test of enablement is whether one reasonably skilled in the art could make or 
use the invention from the disclosures in the patent coupled with information 
known in the art without undue experimentation. United States v. Telectronics, 
Inc., 857 F.2d 778, 785, 8 U.S.P.Q.2d 1217, 1223 (Fed. Cir. 1988). 

Likewise, the Manual of Patent Examining Procedure (MPEP) states that the enablement 
requirement "has been interpreted to require that the claimed invention be enabled so that any 
person skilled in the art can make and use the invention without undue experimentation." 
(MPEP, 7 th Edition, July 1998, Section 2164.01, page 2100-145.) The MPEP further states that, 
n [t]he fact that experimentation may be complex does not necessarily make it undue, if the art 
typically engages in such experimentation...[t]he test of enablement is not whether any 
experimentation is necessary, but whether, if experimentation is necessary, it is undue." (MPEP, 
7 th Edition, Section 2164.01(b), page 2100-147.) 

The determination of enablement must be made based on consideration of the evidence as 
a whole and that, "the evidence provided by applicant need not be conclusive but merely 
convincing to one skilled in the art." (MPEP, 7 th Edition, Section 2164.05, page 2100-150.) 
(Emphasis in original.) 

Appellants respectfully submit that a proper prima facie case of lack of enablement has 



not been established. As was set forth by the Court of Appeals for the Federal Circuit in In re 
Brana, 34 U.S.P.Q.2d 1436 (C.A.F.C. 1995): 



This court's predecessor has stated: 

[A] specification disclosure which contains a teaching of the manner and process of 
making and using the invention in terms which correspond in scope to those used in 
describing and defining the subject matter sought to be patented must be taken as in 
compliance with the enabling requirement of the first paragraph of Section 112 unless 
there is reason to doubt the objective truth of the statements contained therein which must 
be relied on for enabling support. In re Marzocchi, 439 F.2d 220, 223, 169 USPQ 367, 
369 (CCPA 1971). From this it follows that the PTO has the initial burden of challenging 
a presumptively correct assertion of utility in the disclosure. Id. at 224, 169 USPQ at 
370. Only after the PTO provides evidence showing that one of ordinary skill in the art 
would reasonably doubt the asserted utility does the burden shift to the applicant to 
provide rebuttal evidence sufficient to convince such a person of the invention's asserted 
utility. See In re Bundy, 642 F.2d 430, 433, 209 USPQ 48, 51 (CCPA 1981). 
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The Examiner asserted in the Office Action dated February 9, 2001 (Paper Number 13, 
page 4), that "Applicants own submission admits to the fact that these polypeptides are not 
expressed solely in cancerous tissue so the applicability of the proteins is still at question. One 
skilled in the art can not rely on Applicants expressed opinions and the percentages recited in the 
specification and in Applicants' arguements of Paper 12, pages 6 and 7 as definitive proof that 
SEQ ED Nos:l and 2 as compounds that possibly can be used as diagnostic tools, therapeutics 
agents or as pharmaceutical agents." However, to satisfy the enablement requirement, the burden 
does not fall on the Appellants to provide definitive proof that the polypeptides of SEQ ID NO:l 
and SEQ ID NO:2 are useful as diagnostic tools, therapeutic agents, or pharmaceutical agents. 
The Appellants need only supply convincing evidence that the claimed invention is enabled by 
the specification. 

The specification clearly discloses that at least 72% of the tissue libraries in which 
PGAMP-1 was detected were cancerous or immortalized, and at least 76% of the tissue libraries 
in which PGAMP-2 was detected were cancerous or immortalized (specification, e.g., at page 14, 
lines 10-12; and page 15, lines 4-6). One of skill in the art, upon reading this disclosure in the 
context of the patent application at the time of filing, would be reasonably convinced that the 
claimed polypeptides could be used as diagnostic tools, therapeutic agents, and/or 
pharmaceutical agents. The Examiner has provided no objective evidence or sound scientific 
reasoning to show that the claimed polypeptides, which "are not expressed solely in cancerous 
tissue," could not be used in the manner asserted by the Appellants. The Examiner has also not 
provided objective evidence or sound scientific reasoning to show that expression of the 
polypeptides solely in one type of tissue is a necessary requirement for using polypeptides, such 
as those claimed, in the manner asserted by the Appellants. Therefore, the Examiner has not met 
the burden to establish a prima facie case of lack of enablement. Therefore the decision of the 
Examiner rejecting the claims for lack of enablement should be reversed. 

(10) CONCLUSION 

Appellants respectfully submit that rejections for lack of utility based, inter alia, on an 
allegation of "lack of specificity" as set forth in the Office Action and as justified in the Revised 
Interim and final Utility Guidelines and Training Materials, are not supported in the law. Neither 
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are they scientifically correct, nor supported by any evidence or sound scientific reasoning. 
These rejections are alleged to be founded on facts in court cases such as Brenner and Kirk, yet 
those facts are clearly distinguishable from the facts of the instant application, and indeed most if 
not all nucleotide and protein sequence applications. Nevertheless, the PTO is attempting to 
mold the facts and holdings of these prior cases, "like a nose of wax," 1 to target rejections of 
claims to polypeptide and polynucleotide sequences where biological activity information has 
not been proven by laboratory experimentation, and they have done so by ignoring perfectly 
acceptable utilities fully disclosed in the specification as well as well-established utilities known 
to those of skill in the art. As is disclosed in the specification, and even more clearly, as one of 
ordinary skill in the art would understand, the claimed invention has well-established, specific, 
substantial, and credible utilities. The "utility" rejections are, therefore, improper and should be 
reversed. 

Moreover, to the extent the above rejections were based on the Revised Interim and final 
Examination Guidelines and Training Materials, those portions of the Guidelines and Training 
Materials that form the basis for the rejections should be determined to be inconsistent with the 
law. 

The enablement rejections, with respect to the "variant" polypeptide sequences of claim 
2, and the allegation that undue experimentation would be required to use the claimed invention, 
should also be reversed, based on at least the arguments presented above. 

Due to the urgency of this matter, and its economic and public health implications, an 
expedited review of this appeal is earnestly solicited. 



1 "The concept of patentable subject matter under §101 is not Mike a nose of wax which 
may be turned and twisted in any direction * * White v. Dunbar, 1 19 U.S. 47, 51." (Parker v. 
FlooK 198 USPQ 193 (US SupCt 1978)) 
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APPENDIX 

Claims on appeal: 

1. A substantially purified polypeptide comprising an amino acid sequence selected from 
the group consisting of SEQ ID NO: 1 and SEQ ID NO:2. 

2. An isolated polypeptide comprising an amino acid sequence selected from the group 
consisting of: 

a) a naturally-occurring amino acid sequence having at least 90% amino acid sequence 
identity to SEQ ID NO:l, and 

b) a naturally-occurring amino acid sequence having at least 90% amino acid sequence 
identity to SEQ ID NO:2. 

21. A polypeptide of claim 1, having the amino acid sequence of SEQ ID NO:l or SEQ 
ID NO:2. 

22. A composition comprising a polypeptide of claim 21 in conjunction with a suitable 
pharmaceutical carrier. 

27. A composition comprising a polypeptide of claim 1 in conjunction with a suitable 
pharmaceutical carrier. 

28. A substantially purified polypeptide comprising a fragment of the polypeptide of 
claim 1, wherein said fragment consists of at least 15 contiguous amino acids of SEQ ID NO:l, 
and wherein said fragment binds specifically with an anti-PGAMP-1 antibody. 

29. A composition comprising the polypeptide of claim 28 in conjunction with a suitable 
pharmaceutical carrier. 
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cell s development or response and should help in the elucidation of speciric and 
sensitive biomarkers representing, for example, different types of cancer or previous 
exposure to certain classes of chemicals that are enzyme inducers. 

In drug metabolism, many of the xenobiotic-metabolizing enzvmes (including 
the well-characterized isoforms of cytochrome P450) are inducible bv drugs and 
chemicals in man (Pelkonen et al. 1998). predominantly involving transcriptional 
activation of not only the cognate cytochrome P450 genes, but additional cellular 
proteins which may be crucial to the phenomenon of induction. Accordmglv the 
development of methodology to identify and assess the full complement of genes 
C/> that are either up. or down-? egulated- by inducers are crucial in the development of 

—I knowledge to understand the precise molecular mechanisms of enzvme induction 

^ and how this relates to drug action. Similarly, in the field of chemical-induced 

<C toxicity, it is now becoming increasingly obvious that most adverse reactions to 

^ drugs and chemicals are the result of multiple gene regulation, some of which are 

causal and some of which are casually-related to the toxicological phenomenon per 
" J^x ° bservsmon has led to a" upsurge in interest in gene-profiling technologies 
which differentiate between the control and toxin-treated gene pools m target tissues 
and is. therefore, of value in rationalizing the molecular mechanisms of xenobiotic- 
induced toxicity. Knowledge of toxin-dependent gene regulation m target tissues is 
not solely an academic pursuit as much interest has been generated in the 
pharmaceutical industry to harness this technology in the earlv identification of toxic 
W drug candidates, thereby shortening the developmental process and contributing 

Jg, substantially to the safety assessment of new drugs. For example, if the gene profile 

^ in response to say a testicular toxin that has been well-characterized in vivo could be 

determined in the testis, then this profile would be representative of all new drug 
candidates which act via this specific molecular mechanism of toxicity, thereby 
providing a useful and coherent approach to the early detection of such' toxicants 
Whereas it would be informative to know the identity and functionality of all genes 
up/down regulated by such toxicants, this would appear a longer term goal, as the 
majority of human genes have not yet been sequenced, far less their functionality 
determined. However, the current use of gene profiling vields a pattern of gene 
changes for a xenobiotic of unknown toxicity which mav be matched to-that of well- 
characterized toxins, thus alerting the toxicologist to possible in vivo similarities 
between the unknown and the standard, thereby providing a platform- for more 
extensive toxicological ecamirarion. Such approaches are beginning to gam 
momentum, in that several biotechnology companies are commercially producing 
gene chips' or 'gene arrays' that may be interrogated for toxicity assessment of 
xenobioncs. These chips consist of hundreds/thousands of genes, some of which are 
degenexate-in the sense that not all of the genes arc mechanistically-related to any 
one lexicological phenomenon. Whereas these chips are useful in broad-spectnim 
screening, they are maturing at a substantial rate, in that gene arrays are now 
becoming more specific, e.g. chips for the identification of changes in growth factor 
families that contribute to the aetiology and development of chemically-induced 
neoplasias. 

Although documenting and explaining-theie genetic changes presents a 
formidable obstacle to understanding the different mechanisms of development and 
disease progression, the technology is now available^ begin attempting this difficult 
challenge Indeed several 'differential expression analysis' methods have been 
developed which facilitate the identification of gene products that demonstrate 
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1. An important feature of the work of many molecular bio ioeists is idem i tying which 
genes are switched on and off in a cell under different environmental conditions or 
subsequent to xenobionc challenge. Such information has many uses, including the 
deciphering of molecular pathways and facilitating the development of new experimental 
and diagnostic procedures. However, the student of gene hunting should be forgiven for 
perhaps becoming confused by the mountain of information available as there appears to be 
almost as many methods of discovering differentially expressed genes as there are research 
groups using the technique. 

2. The aim of this review was to clarify the main methods of differential gene expression 
analysis and the mechanistic principles underlying them. Also included is a discussion on 
some of the practical aspects of using this technique. Emphasis is placed on the so-called 
*open * systems, which require no pnor knowledge of the genes contained within the study 
model. Whilst these will evenrually be replaced by ' closed * systems in the study of human, 
mouse and other commonly studied laboratory animals, they wilt remain a powerful tool for 
those examining less fashionable models. 

3. The use of suppression-PCR subtractive hybridization is exemplified in the 

identification of up* and down- regulated genes in rat liver following exposure to pheno- 
barbital. a well-known inducer of the drug metabolizing enzymes. 

LK 4. Differential gene display provides a coherent platform for building libraries and 

microchip arrays of "gene fingerprints' characteristic of known enzyme inducers and 
xenobiotic toxicants, which may be interrogated subsequently for the identification and 

i tres network to the oniine edition. characterization of xenobiotics of unknown biological properties. 
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Introduction 

It is now apparent that the development of almost all cancersjind many non- 
neoplastic diseases are accompanied by altered gene expression in -the affected cells 
composed to their normal state (Hunter 1991. W.ynford-Thomas 1991. Vogelstein 
and Kinxler 1993, Semenza 1994, Cassidy 1 995. Ivieinjan and Van Hegnmgen 1998). 
Such changes also occur in response to external stimuli such as pathogenic micro- 
organisms (Rohn et aL 1996, Singh et aL 1997, Griffin and Krishna 1998, Lunney 
1998) and xenobiotics (Sewail et al. 1995, Dogra et aL 1998, Ramana and Kohli 
1998), as well as during the development of undifferentiated cells (Hecht 1998, 
Rudin and Thompson 1998, Schneider-Maunoury et aL- 1998). The potential 
medical and therapeutic benefits of understanding the molecular changes which 
occur in any given cell in progressing from the normal to the 1 altered' state are 
enormous. Such profiling essentially provides a._ '.fingerprint ' of each step of a 
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altered expression in cells of one population compared to another. These methods 
have been used to identify differential gene expression in many situations, includinc 
invading pathogenic microbes (Zhao et al. 1998), in cells responding to extracellular 
and intracellular microbial invasion (Duguid and Dmauer 1990. Ragno et al. 1997. 
Maldarelli et al. 1998). in chemically treated cells (Syed et al. 1997. Rockett et al. 
1999), neoplastic cells (Liang et al. 1992, Chang and Terzaehi-Howe 1998), 
activated cells (Gurskaya et al. 1996, Wan et al. 1996), differentiated cells (Hara et 
al. 1991, Guimaraes et al. 1995a. b), and different cell types (Davis et al. 1984. 
Hedrick et al. 1984', Xhu et al. 1998). Although differential expression analysis 
technologies are applicable to a broad range of models, perhaps their most important 
advantage is that, in most cases. 'absolutely no prior knowledge of the specihe eenes 
which are up- or down -regulated is required. 

The field of differential expression analysis is a large and complex one. with 
many techniques available to the potential user. These can be categorized into 
several methodological approaches, including: 

(1) Differential screening, 

(2) Subtractive hybridization (SH) (includes methods such as chemical cross- 
linking subtraction — CCLS. suppression-PCR subtractive hybridization — 
SSH, and representational difference analysis — RDA), 

(3) Differential display (DD), 

(4) Restriction endonuclease facilitated analysis (including serial analysis of gene 
expression — SAGE — and gene expression fingerprinting — GEF), 

(5) Gene expression arrays, and 

(6) Expressed sequence tag (EST) analysis. 

The above approaches have been used successfully to isolate differentially 
expressed genes in different model systems. However, each method has its own 
subtle (and sometimes not so subtle) characteristics which incur various advantages 
and disadvantages. Accordingly, it is the purpose of this review to clarify the 
mechanistic principles underlying the main differential expression methods and to 
highlight some of the broader considerations and implications of this very powerful 
and increasingly popular technique. Specifically, we will concentrate on the so- 
called 'open* systems, namely those which do not require any knowledge of gene 
sequences and, therefore, are useful for isolating unknown genes. Two * closed' 
systems (those utilising previously identified gene sequences), EST analysis and the 
use of DNA arrays, -will also- be considered briefly for completeness. Whilst 
emphasis will often be placed on suppression PCR subtractive hybridization (SSH, 
the approach employed in this laboratory), it is the aim of the authors to highlight, 
wherever possible, those areas of common interest to those who use, or intend to use, 
differential gene expression analysis. - - - 



Differential cONA library screening (DS) 

Despite the development of multiple technological advances which have recently 
brought the field of gene expression profiling to the forefront of molecular analysis, 
recognition of the importance of differential gene expression and characterization of 
differentially expressed genes has existed for many years. One of the original 
approaches used to identify such genes was described 20 years ago by St John and 
Davis (1979). These authors developed a method, termed ' differential plaque filter 
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hybridization*, which was used ro isolate galactose-inducible DNA sequences from 
yeast. The theory is simple: a genomic DNA library is prepared from normal, 
unstimulated cells of the test organism/ tissue and multiple filter replicas are 
prepared. These replica blots are probed with radioactively (or otherwise) labelled 
complex cDNA probes prepared from the control and test cell mRN A populations. 
Those mRNAs which are differentially expressed in the treated cell population will 
show a positive signal only on the filter probed with cDNA from the treated cells. 
Furthermore, labelled cDNA from different test conditions can be used to probe 
multiple blots, thereby enabling the identification of mRNAs which are only up- 
regulated under certain conditions. For example, St John and Davis (1979) screened 
replica filters with acetate-, glucose- and galactose-derived probes in order to obtain 
genes induced specifically by galactose metabolism. Although groundbreaking in its 
time this method is now considered insensitive and time-consuming, as up to 2 
months are required to complete the identification of genes which are differentially 
expressed in the test population. In addition, there is no convenient way to check 
that the procedure has worked until the whole process has been completed. 

Subtractive Hybridization (SH) 

The developing concept of differential gene expression and the success of early 
approaches such as that described by St John and Davis (1979) soon gave rise to a 
search for more convenient methods of analysis. One of the first to be developed was 
SH, numerous variations of which have since been reported (see below). In general, 
this approach involves hybridization of mRN A/cDNA from one population (tester) 
to excess mRNA/cDNA from another (driver), followed by separation of the 
unhybridized tester fraction (differentially expressed) from the hybridized common 
sequences. This step has been achieved physically, chemically and through the use 
of selective polymerase chain reaction (PCR) techniques. 

Physical separation 

Original subtractive hybridization technology involved the physical separation 
of hybridized common species from unique single stranded species. Several methods 
of achieving this have been described- including hydroxyapante chromatography 
(Sargent and Dawid 1983), avidin-biotm technology (Duguid and Dinauer 1990) 
and oligodT-latex separation (Hara et al. 1991). In the first approach, common 
mRNA species are removed by cDNA (from test cells)-mRNA (from control cells) 
subtractive hybridization followed by hydroxyapatite chromatography, as hydroxy- 
apatite specifically adsorbs the cDNA-mRNA hybrids. The unabsorbed cDNA is 
then used either for the construction of a cDNA library of differentially expressed 
genes (Sargent and Dawid 1983, Schneider etal. 1988) or directly as a probe to 
screen a preselected library (Zimmerman et al. 1980, Davis et al. 1984, Hedrick et al. 
1984). A schematic diagram of the procedure is shown in figure 1. 

Less rigorous physical separation procedures coupled with sensitivity enhancing 
PCR steps were later developed as a means to overcome some of the problems 
encountered with the hydroxyapatite procedure. For example, Daguid and Dinauer 
(1990) described a method of subtraction utilizing biotin-affinity systems as a means 
to remove hybridized common sequences. In this process, both the control and 
tester mRNA populations are first converted to cDNA and an adaptor (' oligovector \ 
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Figure 2. The uie of oligodT M latex to perform subtracrive hybridization. mRNA extracted from the 
control (driver) population is convened to anchored cDNA using polydT oligonucleotides 
attached to latex beads. mRNA from the treated/altered (tester) population is repeatedly 
hybridized against an excess of the anchored driver cDNA. The final population of mRNA is 
tester specific and can be converted into cDNA for cloning and other downstream applications, as 
described by Hara rr ai. (1991). 
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at the 5' end, the final pool of random cDNA fragments is a PCR-renewabie cDNA 
population which is representative of the expressed gene pool and can be used to 
synthesize sense RNA for use as driver material. Furthermore, if the final pool of 
random cDNA fragments is reamplified using biotmylated T7 primer and random 
hexamer, the product can be captured with streptavidin beads and the antisense 
strand eluted for use as tester. Since both target and driver can be generated from 
the same DROP product, subtraction can be performed in both directions (i.e. for 
up- and down -regulated species) between two different DROP products. 

Representational Difference Analysis (RDA) 

RDA of cDNA (Hubank and Schatz 1994) is an extension of the technique 
originally applied to genomic DNA as a means of identifying differences between 
two complex genomes (Lisitsyn et al. 1993). It is a process of subtraction and 
amplification involving subtractive hybridization of the tester in the presence of 
excess driver. Sequences in the tester that have homologues in the driver are 
rendered unamplifiable, whereas those genes expressed only in the tester retain the 
ability to be amplified by PCR. The procedure is shown schematically in figure 4. 

In essence, the driver and tester mRNA populations are first convened to cDNA 
and amplified by PCR following the ligation of an adaptor. The adaptors are then 
removed from both populations and a new (different) adaptor ligated to the 
amplified tester population only. Driver and tester populations are next melted and 
hybridized together in a ratio of 100: 1. Following hybridization, only tester: tester 
homohybrids have 5' adaptors at each end of the DNA duplex and can, thus, be filled 
in at both 3' ends. Hence, only these molecules are amplified exponentially during 
the subsequent PCR step. Although tester : driver heterohybrids are present, they 
only amplify in a linear fashion, since the strand derived from the driver has no 
adaptor to which the primer can bind. Driver: driver heterohybrids have no 
adaptors and, therefore, are not amplified. Single stranded molecules are digested 
with mung bean nuclease before a further PCR-enrichment of the tester: tester 
homohybrids. The adaptors on the amplified tester population are then replaced and 
the whole process repeated a further two or three times using an increasing excess of 
driver (Hubank and Shatz used a tester : driver ratio of 1:400, 1:80000 and 
1 : 800000 for the second, third and fourth hybridizations, respectively). Different 
adaptors are ligated to the tester between successive rounds of hybridization and 
amplification to prevent the accumulation of PCR products that might interfere with 
subsequent amplifications. The final display is a series of differentially expressed 
gene products easily observable on an ethidium bromide gel. 

The main advantages of RDA are that it offers a reproducible and sensitive 
approach to the analysis of differentially expressed genes. Hubank and Schatz (1994) 
reported that they were able to isolate genes that were differentially expressed in 
substantially less than 1 % of the cells from which the tester is derived. Perhaps the 
main drawback is that multiple rounds of ligation, hybridization, amplifiation and 
digestion are required. The procedure is, therefore, lengthier than many other 
differential display approaches and provides more opportunity for operator-induced 
error to occur Although the generation of false' positives has been noted, this has 
been solved'to some degree by O'Neill and Sinclair (1 997) through the use of HPLC- 
purified adaptors. These are free of the truncated adaptors which appear to be a 
major source of the false positive bands. A very similar technique to RDA, termed 
linker capture subtraction (LCS) was described by Yang and Sytowski (1996). 
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Figure 4. The representations] difference analysis (RDA) technique. Driver and tetter cDNA are 
digested with a 4-cutter restriction enzyme such as Dpnll. The 1* tet of 12/24 adaptor strands 
(oligonucleotides) are iigated to each other and the digested cDNA products. The 12mer is 
subsequently melted away and the 3'end* filled in using Taq DNA polymerase. Each cDNA 
population is then amplified using PCR, following which the 1* set of adaptors is removed with 
Dpnll. A second set of 12/24 adaptor strands is then added_to_the amplified tester cDNA 
population, after which the tester is hybridised against "a large excess of driver. The 12mer 
adaptors are melted and the 3' ends filled in as before.-PCR is carried out with primers identical 
to the new 24mer adaptor. Thus, the only hybridization products which are exponentially 
amplified are those which are tester: tester combinations. Following PCR, ssDNA products are 
removed with mung bean nuclease, leaving the 1 first difference product*. This is digested and a 
third set of 12/24 adaptors added before repeating the subtraction process from the hybridization 
stage. The process is repeated to the 3* or 4** difference product, as described by Lisitsyn et al 
(1993) and Hubank and Schatz (1994). * * -- - - - 
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Suppression PCR Subtractive Hybridization i SSH , 

The most recent adaptation of the 5H approach to differential expression 
analysis was first described by Diatchenko et al. (1996) and Gurskaya et al. i 1996>. 
They reported that a 1000-5000 fold enrichment of rare cDNAs (equivalent to 
isolating mRNAs present at only a few. copies per cell) can be obtained without the 
need for multiple hybridizations/subtractions. Instead of physical or chemical 
removal of the common sequences, a PCR-based suppression system is used vsee 
figure 5). 

In SSH, excess driver cDNA is added tcrrwo portions of the tester cDNA which 
have been iigated with different adaptors. A first round of hybridization serves to 
enrich differentially expressed genes and equalize rare and abundant messages. 
Equalization occurs since reannealing is more rapid for abundant molecules than for 
rarer molecules due to the second order kinetics of hybridization (James and Higgms 
1 985 ). The two primary hybridization mixes are then mixed together in the presence 
of excess driver and allowed to hybridize further. This step permits the annealing of 
single stranded complementary sequences which did not hybridize in the primary 
hybridization, and in doing so generates templates for PCR amplification. Although 
there are several possible combinations of the single stranded molecules present in 
the secondary' hybridization mix, only one particular combination (differentially 
expressed in the tester cDNA composed of complimentary strands having different 
adaptors) can amplify exponentially. 

Having obtained the final differential display, two options are available if cloning 
of cDNAs is desired. One is to transform the whole of the final PCR reaction into 
competent cells. Transformed colonies can then be isolated and their inserts 
characterized by sequencing, restriction analysis or PCR. Alternatively, the final 
PCR products can be resolved on a gel and the individual bands excised, reamplified 
and cloned. The first approach is technically simpler and less time consuming. 
However, ligation /transformation reactions are known to be biased towards the 
cloning of smaller molecules, and so the final population of clones will probably not 
contain a representative selection of the larger products. In addition, although 
equalization theoretically occurs, observations in this laborarorv suggest that this is 
by no means perfectly accomplished. Consequently, some gene species are present 
in a higher number than others and this will be represented in the final population 
of clones. Thus, in order to obtain a substantial proportion of those gene species that 
actually demonstrate differential expressiorrin the tester population, the number ox 
clones that will have to be screened after this step may be substantial. The second 
approach is initially more time consuming and technically demanding. However, it 
would appear to offer better prospects for cloning larger and low abundance gel 
products. In addition, one can incorporate" a* screening step that differentiates 
different products of different sequences but of the same size (HA-staining, see 
later). In this way, a good idea of the final number of clones to be isolated and 
identified can be achieved. 

An alternative (or even complementary) approach Is to use the final differential 
display reaction to screen a cDNA library to isolate full length clones for further 
characterization, or a DNA array (see later) to quickly identify known genes. SSH 
has been used in this laboratory to begin characterization of the short-term gene 
expression profiles of enzyme-inducers such as phenobarbital (Rockett et al. 1997) 
and Wy- 14,643 (Rockett et al. unpublished observations). The isolation of 
differentially expressed genes in this manner enables the construction of a fingerprint 
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Figure 5. PCR-select cDNA subtraction. In the priiriary^ybridizarion, an excess of driver cDNA is 
added to each tetter cDNA population. The sampleTare heat denatured and allowed to hybridize 
for between 3 and 8 h. This serves two purposes : (1 ) to equalize rare and abundant molecules ; and 
(2) to enrich for differentially expressed sequences— cDNAs that are not differentially expressed 
form type c molecules with the driver. In the secondary hybridization, the two primary 
hybridizations are mixed together without denaturing. Fresh denstured driver can also be added 
at this point to allow further enrichment of differentially expressed sequences. Type c molecules 
are formed in this secondary hybridization which are subsequently amplified using two rounds of 
PCR. The final products can be.visualtzed on an agarose geljabelled directly or cloned into a 
vector for downstream manipulation. As described by Distchenko et al. (1996) and Gurakaya 
_ et al. (1996), with pcraussioo. 
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Figure 6. Flow diacrun showing method usea in this laboratory to isolate and iqentiry c;ones ot genes 
which are differentially expressed in rat liver rol lowing short term exposure to the enryme 
inducers, phenobarbitai and Wy- 14.643. 



of expressed genes which are unique to each compound and time/dose point. Such 
information could be useful in short-term characterization of the toxic potential of 
new compounds by comparing the gene-expression profiles they elicit with those 
produced by known inducers. Figure 6 shows a flow diagram of the method used to 
isolate, verify and clone differentially expressed genes, and figure 7 shows expression 
profiles obtained from a typical SSH experiment. Subsequent sub-cloning of the 
individual bands, sequencing and gene data base interrogation reveals many genes 
which are either up- or down- regulated by phenobarbitai in the rat (tables 2 and 3). 

One of the advantages in using the SSH approach is that no prior knowledge is 
required of which specific genes are up/ down- regulated subsequent to xenobiotic 
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Figure /. SSH display patterns obtained from rat liver following 3-dav treatment uuh WY-U 643 or 
phenobarbual. mRNA extracted from control and treated livers was used to generate the 
differential displays using the PCR-Select cDNA subtraction kit (Clontechi. Lane- 1— lkb 
w d f Vl"? eneS u P re * uiaicd following Wy . 1 4-643 treatment ; 3— genes downregulated following 
\Vv,14-64j treatment; 4— genes upregulated following pbenobarbital trearment; 5— genes 
downregulated following phenobarbual treatment; 6— lkb ladder. Reproduced from Rockett et 
at. (199/). with permission. 

exposure, and an almost complete complement of genes are obtained. For example, 
the peroxisome proliferator and non-genotoxic hepatocarcinogen Wy, 14,643, up- 
regulates at least 28 genes and down-regulates at least 15 in the rat' (a sensitive 
species) and produces 48 up- and 37 down-regulated genes in the guinea pig, a 
resistant species (Rockett, Swales, Esda and Gibson, unpublished observations). 
One of these genes, CD81, was up-regulated in the rat and down-regulated in the 
guinea pig following Wy-14,643 trearment. CD81 (alternatively named TAPA-1) is 
a widely expressed cell surface protein which is involved in a large number of cellular 
processes including adhesion, activation, proliferation and differentiation (Levy et 
al. 1998). Since all of these functions are altered to some extent in the phenomena 
ot hepatomegaly and non-genotoxic hepatocarcinogenesis. it is intriguing, and 
probably mechanistically-relevant, that CD81 expression is differentially regulated 
in a resistant and susceptible species. However, the down-side of this approach is 
that the majority of genes can be sequenced and matched to database sequences, but 
the latter are predominantly expressed sequence tags or genes of completely 
unknown function, thus partially obscuring a realistic overall assessment of the 
critical genes of genuine biological interest. Notwithstanding the lack of complete 
funtional identification of altered gene expression, such gene profiling studies 
essentially provides a 'molecular fingerprint' in response to xenobiotic challenge, 
thereby serving as a mechanistically-relevant platform for further detailed 
investigations. 



Differential Display (DD) - 

Originally described as * RNA fingerprinting by.acbitrarily primed PCR ' (Liang 
and Pardee 1992) this method is now more commonly referred to as 4 differential 
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Differential gene expression 
Table 2. Genes up-regulated in rat liver following 3-day exposure to pnenooaroua!. 



Band number 






(approximate 


Highest sequence 




size in bp) 


similarity 


FASTA-EMBL eene taentincatton 


5 (1300) 


93.5° 0 


CYP2B1 


T (1000) 


95.1 ° 0 


Preproalbumin 






«?crum aiournin rn rv . > .a 


8 (950) 


98.3 ° 0 


NCI-CCAP-Prl H. sapiens (EST) 


10(850) 


95 .7° 0 


CYP2B1 


11 (800) 


Clone I 94.9 ° 0 


CYP2B1 




Clone 2 75.3 ° 0 


CYP2B2 


12 (750) 


93.8 ° 0 


TRPMO mRNA 






Sulfated glycoprotein 


15 (600) 


92.9% 


Preproalbumin 






Serum albumin mRNA 


16(55) 


Clone 1 95.2 % 


CYP2BI 




Clone 2 93.6 ° 0 


Haptoglobulin mRNA pamal alpha 


21 (350) 


99,3 ° 0 


18S. 5.8S & 28S rRNa 



Bands 1-4. 6. 9. 13. 14. and 17-20 are shown to be false positives by dot blot anaylsis and. therefore, 
are not sequenced. Derived from Rockett et al. (1997). It should be noted that the above genes do not 
represent the complete spectrum of genes which are up-reguiated in rat liver 'by phenobarbital. but 
simply represents the genes sequenced and identified to date. 



Table 3. Genes down-regulated in rat liver following 3 -day exposure to phenobarbital. 



Band number 
(approximate 
size in bp) 



Highest sequence 
similarity* 



FASTA-EMBL gene identification 



1 (1500) 




95.3% 


3-oxoacyt-CoA thiolase 


2 (1200) 




92.3% 


Hemopoxtn mRNA 
Alpha- 2u-globulin mRNA 


3 (1000) 




91.7% 


7(700) 


Clone 1 


77.2% 


\J .musculus C\ inhibitor 




Clone 2 


94.5% 


Electron transfer rlavoprotein 




Clone 3 


91.0% 


A/, musculus Topoi some rase 1 (Topo 1 ) ■ 


8 (650) 


Clone 1 


86.9% 


Soares 2NbMT M. musculus (EST) 




Clone 2 


96.2% 


AlphaOu-globulin is- type) mRNA 


9(600) 


Clone 1 


86.9% 


Soares mouse NML A/, musculus (EST) 




Clone 2 


82.0% 


Soares p3NMF 19.5 A/, musculus (EST) 


10 (550) 




73.8% 


Soares mouse NML A/, musculus (EST*) 


11 (525) 




95.7% 


NCI-CCAP-Prl H. sapiens (EST) 


12 (375) 




100.0% 


Ribosomal protein 


13 (23) 


Clone 1 


97.2 V 


Soar** mouse embrvo NbMEljS (EST*i 




Clone 2 


100.0% 


Fibrinogen B-beta-cnain 




Clone 3 


100.0% 


Apolipoprotem E gene 


14(170) 




96.0 % 


. Soares p3NMF19.5 Af. musculus (EST) 


15 (140) 




97.3% 


Srratagene mouse testis (EST) 


Others: (300) 




96.7% 


R. norvegicus RASP t mRNA 


(275) 




93.1% 


Soares mouse mammary gland (EST) 



EST ~ Expressed sequence tag. Bands 4—6 were shown to be false positives by dot blot analysis and, 
therefore, were not sequenced. Derived from Rockett et al. (1 997). It should be noted that the above genes 
do not represent the complete spectrum of genes which sre down -regulated in rat liver by phenobarbital, 
but simiply represents the genes sequenced and identified to date. 



display ' (DD). In this method, all the mRNA species in the control and treated cell 
populations are amplified in separate reactions using reverse transcriptase-PCR 
/ primed PCR • (Liang (RT-PCR). The products are then run side-by-side on sequencing gels. Those 

:red to as * differential - bands which are present in one display only, of- which are much more intense in one 
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display compared to the other, are differentially expressed and may be recovered for 
further characterization. One advantage of this system is the speed with which it can 
be carried out — 2 days to obtain a display and as little as a week to make and identify 
clones. 

Two commonly used variations are based on different methods of priming the 
reverse transcription step (figure 8). One is to use an oligo dT with a 2 -base ' anchor ' 
at the 3'-end, e.g. 5 # (dT u )CA 3' (Liang and Pardee 1992). Alternatively. an 
arbitrary primer may be used for 1st strand cDNA synthesis (Welsh et al. 19°2). 
This variant of RXA fingerprinting has also been called RAP* (RNA Arbitrarily 
Primed)-PCR. One advantage of this second approach is that PCR products may be 
derived from anywhere in the RNA, including open reading frames. In addition, it 
can be used for mRNAs that are not polyadenylated, such as many bacterial mRNAs 
(Wong and McClelland 1994). In both cases, following reverse transcription and 
denaturation, second strand cDNA synthesis is carried out with an arbitrary primer 
(arbitrary primers have a single base at each position, as compared to random 
primers, which contain a mixture of all four bases at each position). The resulting 
PCR, thus, produces a series of products which, depending on the system (primer 
length and composition, polymerase and gel system), usually includes 50-100 
products per primer set (Band and Sager 1989). When a combination of different 
dT-anchors and arbitrary* primers are used, almost all mRN A species from a cell can 
be amplified. When the cDNA products from two different populations are analysed 
side by side on a polyacrylamide gel, differences in expression can be identified and 
the appropriate bands recovered for cloning and further analysis. 

Although DD is perhaps the most popular approach used today for identifying 
differentially expressed genes, it does suffer from several perceived disadvantages : 

(1) It may have a strong bias towards high copy number mRNAs (Bertioli et al. 
1995), although this has been disputed (Wan et al. 1996) and the isolation of very- 
low abundance genes may be achieved in certain circumstances (Guimeraes et 
al. 1995a). 

(2) The cDNAs obtained often only represent the extreme 3' end of the mRNA 
(often the 3 '-untranslated region), although this may not always be the case 
(Guimeraes et al. 1995a). Since the 3' end is often not included in Genbank and 
shows variation between organisms. cDNAs identified by DD cannot always be 
matched with their genes, even if they have been identified. 

(3) The pattern of differentia] expression seen on the display often cannot be 
reproduced on Northern blots, with false positives arising in up to 70 ° 0 of cases 
(Sun et al. 1994), Some adaptations have been shown to reduce false positives, 
including the use of two reverse transcriptases (Sung and Denman 1997), 
comparison of uninduced and induced celts over a time course (Burn et al. 1994) 
and comparison of DDPCR-products from two uninduced and two induced 
lines (Sompayrac et al. 1995). The latter authors also reported that the use of 
cytoplasmic RNA rather then total RNA reduces false positives arising from 
nuclear RNA that is not transported to the cytoplasm. 

Further details of the background, strengths and weaknesses of the DD 
technique'can be obtained" from a review "by McClelland ' et al. (1996) and from 
articles by Liang et at. (1995) and Wan et al. (1996)7" ~ 
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Figure 8. Two approaches to differential display (DD) analysis. 1" strand synthesis can be carried out 
either with a polydT„NN pnmer (where N = G. C or A) or with an arbitrary primer. The use of 
different combinations of C. C and A to anchor the first strand polydT primer enables the priming 
of the majority of polyadenylated mRNAs. Arbitrary primers may hybridize at none, one or more 
places along the length of the mRNA, allowing 1" strand cDNA synthesis to occur at none, one 
or more points in the same gene. In both cases. Z* 6 strand synthesis is carried out with an arbitrary 
primer. Since these arbitrary primers for the 2"* 1 strand may also hybridize to the 1" strand cDNA 
in a number of different places, several different 2*° strand products may be obtained from one 
binding point of the I" strand pnmer. Following 2 nd strand synthesis, the ongmal set of pnmers 
is used to amplify the second strand products, with the result that numerous gene sequences are 
amplified. 



Restriction endonuclease-facilitated analysis of gene expression 

Serial Analysis of Gene Expression (SAGE) 

A more recent development in the field of differential display is SAGE analysis 
(Velculescu et aL 1995). This method uses a different approach to those discussed so ' 
fax and is based on two principles. Firstly, in more than 95% of cases, short 
nucleotide sequences ('tags-') of- only- nine or 10 base pairs provide sufficient 
information to identify their gene of origin. Secondly, concatonation (linking 
together in a series) of these tags allows sequencing of multiple cDNAs within a 
single clone. Figure 9 shows a schematic representation of the SAGE process. In this 
procedure, double stranded cONA from the test cells is synthesized with a 
biotinylated polydT primer. Following -digestion with a commonly cutting (4bp 
recognition sequence) restriction enzyme CJanchoring enzyme*), the 3' ends of the 
cDN A population are captured with strep tavidin beads. The captured population is 
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split into two and different adaptors ligated to the 5 ' ends of each group. Incorporated 
into the adaptors is a recognition sequence for a type IIS restriction enzyme — one 
which- cuts DNA at a defined distance (< 20 bp) from its recognition sequence. 
Hence, following digestion of each captured cDNA population with the 115 enzyme, 
the adaptors plus a short piece of the captured cDNA are released. The two 
populations are then ligated and the products amplified. The amplified products are 
cleaved with the original anchoring enzyme, reiigated (concatomers are formed in 
the process) and cloned. The advantage of this system is that hundreds of gene tags 
can be identified by sequencing only a few clones. Furthermore, the number of times 
a given transcript is identified is a quantitative measurement of that gene's 
abundance in the original population, a feature which facilitates identification of 
differentially expressed genes in different cell populations. 

Some disadvantages of SAGE analysis include the technical difficulty of the 
method, a large amount of accurate sequencing is required, biased towards abundant 
mRNAs, has not been validated in the pharmaco/toxicogenomic setting and has 
only been used to examine well known tissue differences to daie. 



Gene Expression Fingerprinting (CEF) 

A different capture/restriction digest approach for isolating differentially 
expressed genes has been described by Ivanova and Belyavsky (1995). In this 
method, RNA is convened to cDNA using biotinylated oligo(dT) primers. The 
cDNA population is then digested with a specific endonuclease and captured with 
magnetic streptavidin micro beads to facilitate removal of the unwanted 5' digestion 
products. The use of restricted 3 '-ends alone serves to reduce the complexity of the 
cDNA fragment pool and helps to ensure that each RNA species is represented by 
not more than one restriction product. An adaptor is ligated to facilitate subsequent 
amplification of the captured population. PCR is carried out with one adaptor- 
specific and one biotinylated polydT primer. The reamplified population is 
recaptured and the non-biotinylated strands removed by alkaline dissociation. The 
non-biotinylated strand is then resynthesized using a different adaptor-specific 
primer in the presence of a radiolabeled dNTP. The labelled immobilized 3' cDNA 
ends are next sequentially treated with a series of different restriction endonucleases 
and the products from each digestion analysed by PAGE. The result is a fingerprint 
composed of a number of ladders <equal to the number of sequential digests used). 
By comparing test versus control fingerprints, it is possible to identify- differentially 
expressed products which can then be isolated from the gel and cloned. The 
advantages of this procedure are that it is very robust and reproducible, and the 
authors estimate that 80-93% of cDNA molecules are involved in the final 
fingerprint. The disadvantage is that polyacrylamide gels can rarely resolve more 
than 300-400 bands? which compares poorly to~ the 1000 or more which are 
estimated to be produced in- an average experiment. The use of 2-D gels such as 
those described by Uitterlinden etaL (1989) and Hatada et al. (1991) may help to 
overcome this problem. 

A similar method for displaying restriction endonuclease fragments was later 
"described, by Prashar_and [ Weiss man (1 996)7 Howeve r, instead of sequential 
digestion of the immobolized 3 , -terrninal_cI)NA fragments, these authors simply 
compared the profiles ofc the * contr o l and -treatetFpopulations without further 
manipulation, 



Differential gene expression 



o * j 



h group. Incorporated 
unction enzyme— one 
recognition sequence, 
i with the IIS enzyme, 
re released. The two 
implified products are 
itomers are formed in 
hundreds of gene tags 
c, the number of times 
:ment of that gene's 
tates identification of 

nicai difficulty of the 
sed towards abundant 
lomic setting and has 
ate. 



;olating differentially 
ivsky (1995). In this 
go(dT) primers. The 
ise and captured with 
unwanted 5' digestion 
the complexity of the 
cies is represented by 
> facilitate subsequent 
ut with one adaptor- 
plified population is 
line dissociation. The 
;rent adaptor-specific 
mmobilized 3' cDNA 
nction endonucleases 
• resuit is a nngerpnnt 
luennai digests used ), 
identify differentially 
gel and cloned. The 
reproducible, and the 
nvolved in the final 
n rarely resolve more 
1 or more which are 
e of 2-D gels such as 
*/. (1991) may help to 

s fragments was later 
nstead of sequential 
these authors simply 
ions without further 



■AAAA 



1* strano cDNA syntnesis using 
btotmyiaiec pay dT pnmers 



cONA deaved witn AE and 
^ captured with streptavivm Deads 



CTAC 



-AAAA 



.AAAA 



GTAC 



Divide in naff and Hgate inkers 



CATC 
GTAC. 



CATG- 
GTAC 



-AAAA 



CATG- 
GTAC 

CATG 
GTAC- 



• AAAA 



JVAAA 
"TTTT 



Cleave with tagging enzyme (TE) j 
and produce blunt ends 



GGATGCATGXXXXXXXXX 
CCTACGTACXXXXXXXXX 



GGATGCATG00OOOOOO0 
CCTACGTACOOOOOOOOO 



TE 



TE A£ 



Ugate and amplify 



GGATGCATGXXXXXXXXXOOOOOOOOOCATGCATCC 
CCTACGTACXXXXXXXXXOOOOOOOOOGTACGTAGG 



AE 



OiTag 



AE 



Cleave wtm AE. isotas ailags. 
concatenate, oone ana 
sequence 

AE 



— CATGXXXXXXXXXOOOOOOOOOCATG XXXXXXXXXOOOOQOOOOCATG— 
-^AC)OXXXXXXXOOOOOOOOOGTAC XXXXXXXXXOOOOOOOOOGTAG— 



Tag1 Tag2 



Tag3 Tag 4 



Figure 9. Serial analysis of gene expression (SAGE) analysis. cONA is cleaved with an anchoring enzyme 
(AE) and the 3' ends captured using streptavidin beads. The cDNA pool is divided in half and each 
portion ligated to a different linker, each containing a type US restriction site (tagging enzyme. 
TE). Restriction with the type US enzyme releases the linker plus a short length of cDNA 
(XXXXX and OOOOO indicate nucleotides of different tags). The two pools of tags are then 
ligated and amplified using linker-specific primers. Following PCR, the products are cleaved with 

. the AE and the-ditfgr isolated from the linkers using PAGE. The ditags are then ligated (during 

which process, concatentzanon occurs) and cloned into a vector of choice for sequencing. After 
Velcuieacu cf al. (1995), with permission.' * . 
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DNA arrays 

'Open' differential display systems are cumbersome in that it takes a great deal 
of time to extract and identify candidate genes and then confirm that thev are indeed 
up- or down-regulated in the treated compared to the control tissue. Normally, the 
latter process is carried out using Northern blotting or RT-PCR. Even so. each of 
the aforementioned steps produce a bottleneck to the ultimate goal of rapid analvsis 
of gene expression. These problems will likely be addressed bv the development of 
so-called DNA arrays (e.g. Gress et al. 1992. Zhao et al. 1993. Schena et al 1996) 
the introduction of which has signalled the next era in differential gene expression 
•analysis. DNA arrays consist of a' gndded membrane or glass chips' containing 
hundreds or thousands of DNA spots, each consisting of multiple copies of part of 
a known gene. The genes are often selected based on previously proven involvement 
in oncogenesis, cell cycling. DNA repair, development and other cellular processes 
They are usually chosen to be as specific as possible for each gene and animal species 
Human and mouse arrays are already commercially available and a few companies 
will construct a personalized array to order, for example Clontech Laboratories and 
Research Genetics Inc. The technique is rapid in that hundreds or even thousands 
of genes can be spotted on a single array, and that mRNA/cDNA from the test 
populations can be labelled and used directly as probe. When analvsed with 
appropriate hardware and software, arrays offer a rapid and quantitative means to 
assess differences in gene expression between two cell populations. Of course, there 
can only be identification and quantitation of those genes which are in the arrav 
(hence the term 'closed* system). Therefore, one approach to elucidating the 
molecular mechanisms involved in a particular disease/development svstem may be 
to combine an open and closed system— a DNA arrav to directlv' identify and 
quantitate the expression of known genes in mRNA populations, and an' open 
system such as SSH to isolate unknown genes which are differentially expressed. 

One of the main advantages of DNA arrays is the huge number of gene fragments 
Ta/^o Ca " PUt ° n 3 membrant - so ™ companies have reported gridding up to 
60000 spots on a single glass 'chip' (microscope slide). These high density chip- 
based micro-arrays will probably become available as mass-produced off-the-shelf 
items in the near future. This should facilitate the more rapid determination of 
differential expression in time and dose-response expenments. Aside from their 
high cost and the technical complexities involved in producing and probing DNA 
arrays, the main problem which remains, especially with the newer rrucro-arrav 
(gene-chip) technologies, is that results are often not whoUv reproducible berween 
arrays. However, this problem is being addressed and should be resolved within the 
next few years. 



EST databases as a means to identify differentisdly.expressed gene* 

Expressed sequence tags (ESTs) are partial sequences of clones obtained from 
cDNA libraries. Even though most ESTs have no formal identity (putative 
identification is the best to be hoped for), they have proven to be a rapid and efficient 
means of discovering new genes and can be- used to generate profiles of gene- 
expression" in specific cells. Since theywere first described by Adams et al. (1991) 
there has been a huge explosion in EST production and it is estimated that there are 
now well over a million such sequences in the public domain, representing over half 
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of all human genes (Hillier et aL 1996). This large number or freely avawabie 
sequences (both sequence information and clones are normally available royalty-free 
from the originators) has enabled the development of a new approach towards 
differential gene expression analysis as described by Vasmatzis et aL (1998). The 
approach is simple in theory: EST databases are first searched for genes that have a 
number of related EST sequences from the target tissue of choice, but none or few 
from non-target tissue libraries. Programmes to assist in the assembly of such sets of 
overlapping data may be developed in-house or obtained privately or from the 
internet. For example, the Institute for Genomic Research iTIGR. found at 
http://www.tigr.org) provides many software tools free of charge to the scientific 
community. Included amongst these is the TIGR assembler (Sutton et aL 1995). a 
tool for the assembly of large sets of overlapping data such as EST*, bacterial 
artificial chromosomes (BAC)s, or small genomes. Candidate EST clones repre- 
senting different genes are then analysed using RNA blot methods for size and tissue 
specificity and, if required, used as probes to isolate and identify the full length 
cDNA clone for further characterization. In practice however, the method is rather 
more involved, requiring bioinformatic and computer analysis coupled with 
confirmatory molecular studies. Vasmatzis et at. (1998) have described several 
problems in this fledgling approach, such as separating highly homologous 
sequences derived from different genes and an overemphasis of specificity for some 
EST sequences. However, since these problems will largely be addressed by the 
development of more suitable computer algorithms and an increased completeness 
of the EST database, it is likely that this approach to identifying differentially 
expressed genes may enjoy more patronage in the future. 



Problems and potential of differential expression techniques 

The holistic or single cell approach ? 

When working with in vivo models of differential expression, one of the first 
issues to consider must be the presence of multiple cell types in any given specimen. 
For example, a liver sample is likely to contain not only hepatocytes, but also 
(potentially) Ito cells, bile ductule cells, endothelial cells, various immune cells (e.g. 
lymphocytes, macrophages and Kupffcr cells) and fibroblasts. Other tissues will 
each nave their own distinctive ceil popuianons. Also, in the case ui neoplastic tissue, 
there are almost always normal, hyperplastic and/ or dyspiastic cells present in a 
sample. One must, therefore, be aware that genes obtained from a differential 
display experiment performed on an animal tissue model may not necessarily arise 
exclusively from the intended 'target* cells, e.g. hepatocytes /neoplastic cells. If 
appropriate, further analyses using immunohistochemistry, in situ hybridization or 
in situ RT-PCR should be used to confirm which cell types are expressing the 
gene(s) of interest. This problem is probably most acute for those studying the 

"differential expression of genes in the "dtfvthjpmenr of different cell types, where 
there is a need to examine homologous cell populations. The problem is now being 
addressed at the National Cancer Institute (Bethesda, MD, USA) where new micro- 
disection techniques have been employed to assist in their gene analysis programme, 
the Cancer Genome Anatomy Project (CGAP) {Fox more information see web site : 
http ://www.ncbi.nlm.nih.gov/ncicgap/intro.html). There are also separation tech* 

~niques available that utilise cell-specific antigens~as a means to isolate target cells, 
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e.g. fluorescence acrivated cell sorting (FACS) (Dunbar et al. 1998. Kas-Deelen et 
al. 1998) and magnetic bead technology (Richard et al. 1998. Rogler et al. 1998). 

However, those taking a holistic approach may consider this issue unimportant. 
There is an equally appropriate view that all those genes showing altered expression 
within a compromized tissue should be taken into consideration. After all, since all 
tissues are complex mixes of different, interacting cell types which intimately 
regulate each other's growth and development, it is clear that each cell type could in 
some way contribute (positively or negatively) towards the molecular mechanisms 
which lie behind responses to external stimuli or neoplastic growth. It is perhaps 
then more informative to carry out differential display experiments using in vxvo as 
opposed to in vitro models, where uniform populations of identical cells probablv 
represent a partial, skewed or even inaccurate picture of the molecular changes that 
occur. 

The incidence and possible implications of inter-individual biological variation 
should be considered in any approach where whole animal models are being used. It 
is clear that individuals (humans and animals) respond in different ways to identical 
stimuli. One of the best characterized examples is the debnsoquine oxidation 
polymorphism, which is mediated by cytochrome CYP2D6 and determines the 
pharmacokinetics of many commonly prescribed drugs (Lennard 1993, Meyer and 
Zanger 1997). The reasons for such differences are varied and complex, but allelic 
variations, regulatory region polymorphisms and even physical and mental health 
can all contribute to observed differences in individual responses. Careful thought 
should, therefore, be given to the specific objectives of the srudy and to the possible 
value of pooling starting material (tissue/mRNA). The effect of this can be 
beneficial through the ironing out of exaggerated responses and unimportant minor 
fluctuations of (mechanistically) irrelevant genes in individual animals, thus 
providing a clearer overall picture of the general molecular mechanisms of the 
response. However, at the same time such minor variations may be of utmost 
importance in deciding the ability of individual animals to succumb to or resist the 
effects of a given chemical/disease. 



Hotc efficient are differential expression techniques at recovering a high percentage of 
differentially expressed genes ? 

A number of groups have produced experimental data suggesting that mam- 
malian cells produce between 8000-15000 different mRNA species at anv one time 
(Mechler and Rabbitts 1981, Hedrick et al. 1984, Bravo 1990), although figures as 
high as 20-30000 have also been quoted (Axel et al 1976). Hedrick et al. (1984) 
provided evidence suggesting that the majority of these belong to the rare abundance 
class. A breakdown of this abundance distribution is shown in table 1. 

Whm the results of differed hal^oU«p4ay-cJ4>erimeftta have been compared with 

data obtained previously using other methods, it is apparent that not all differentially 
expressed mRNAs are represented in the final display. In particular, rare messages 
(which, importantly, often include regulatory proteins) are not easily recovered 
using differential display systems. This is amajor short comin g, as the majority of 

^ mRNX species exist at levels of less than 0.005 ^o'of the tolaTpopulation (table 1). 
Bertioli-e*-a/. (1995) examined- the efficien cy "of D D templates (heterogeneous 
mRNA populations) for recovering rare messages and were unable to detect mRNA 
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species present at less than 1.2 ° 0 of the total mRXA population— equivalent to an 
intermediate or abundant species. Interestingly, when simple model systems ( single 
target only) were used instead of a heterogeneous mRNA population, the same 
primers could detect levels of target mRNA down to 1 0000 * smaller. These results 
are probably best explained by competition for substrates from the many PCR 
products produced in a DD reaction. 

The numbers of differentially expressed mRN As reported in the literature using 
various model systems provides further evidence that many differentially expressed 
mRNAs are not recovered. For example, DeRisi et al. (1997) used DNA array 
technology to examine gene expression in yeast following exhaustion of sugar in the 
medium, and found that more than 1700 genes showed a change in expression of at 
least 2-fold. In light of such a finding, it would not be unreasonable to suggest that 
of the 8000-1 5 000 different mRNA species produced by any given mammalian cell, 
up to 1000 or more may show altered expression following chemical stimulation. 
Whilst this may be an extreme figure, it is known that at least 100 genes are 
activated /upregulated in Jurkat (T-) cells following IL-2 stimulation (Ullman et al. 
1990). In addition. Wan et al. (1996) estimated that interferon-y-stimulated HeLa 
cells differentially express up to 433 genes (assuming 24000 distinct mRNAs 
expressed by the cells). However, there have been few publications documenting 
anywhere near the recovery of these numbers. For example, in using DD to compare 
normal and regenerating mouse liver. Bauer et al. (1993) found only 70 of 38000 
total bands to be different. Of these, 50% (35 genes) were shown to correspond to 
differentially expressed bands. Chen et al. (1996) reported 10 genes upregulated in 
female rat liver following ethinyl estradiol treatment. McKenzie and Drake (1997) 
identified 14 different gene products whose expression was altered by phorbol 
myristate acetate (PMA, a tumour promoter agent) stimulation of a human 
myelomonocytic cell line. Kilty and Vickers (1997) identified 10 different gene 
products whose expression was upregulated in the peripheral blood leukocytes of 
allergic disease sufferers. Linskens et al. (1995) found 23 genes differentially 
expressed between young and senescent fibroblasts. Techniques other than DD 
have also provided an apparent paucity of differentially expressed genes. Using SH 
for example, Cao et al. (1997) found 15 genes differentially expressed in colorectal 
cancer compared to normal mucosal epithelium. Fitzpatnck et al. (1995) isolated 17 
genes upregulated in rat liver following treatment with the peroxisome proliferator. 
clofibrate: Philips et al. (1990) isolated 12 cDNA clones which were upregulated in 
highly metastatic mammary adenocarcinoma cell lines compared to poorly meta- 
static ones. Prashar and Weissman (1996) used 3' restriction fragment analysis and 
identified approximately 40 genes showing altered expression within 4 h of 
activation of Jurkat T-cells. Groenink and Leegwater (1996) analysed 27 gene 
fragments isolated using SSH of delayed early response phase of liver regeneration 
and found only 12 to be upregulated. 

In the laboratory, SSH was used to isolate up to 70 candidate genes which appear 
to show altered expression in guinea pig liver following short-term treatment with 
the peroxisome proliferator, WY-14,643 (Rockett, Swales, Esdaile and Gibson, 
unpublished observations). However, these findings have still to be confirmed by 
analysis of the extracted tissue mRNA for differential expression of these sequences. 
■ * Whilst the latest differential cttspla^technologierarc purported to include design 
_ and experimental modifications to overcome ibisJaj^ oi=efi\ciency (in both the total 
- number of differentially expressed genes recovered and the percentage that are true 
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experiments and animals. DD W on the other hand, is not subject to this prev 
zone since, unlike SH approaches, it does not amplify the difference in expression 
between two samples. Wan et al. (1996) reported that differences in expression of 
twofold or more are detectable using DD* 



Resolution ami visualization of differential expression products 

It seems highly improbable with current technology that a gel system could be 
developed that is able to resolve all gene species showing altered expression in am 
given test system (be it SH- or DD-based). Polyacrylamide gel electrophoresis 
(PAGE) can resolve size differences down to 0.2 ° 0 (Sambrook et a!. 1939) and are 
used as standard in DD experiments. Even so, it is clear that a complex series of gene 
products such as those seen in a DD will contain unresolvable components. Thus, 
what appears to be one band in a gel may in fact turn out to be several. Indeed, it has 
been well documented (Mathieu-Daude et al, 1996, Smith et al. 1997) that a single 
band extracted from a DD often represents a composite of heterogeneous products, 
and the same has been found for SSH displays in this laboratory (Rockett et al. 
1997). One possible solution was offered by Mathieu-Daude et ai (1996), who 
extracted and reamplified candidate bands from a DD display and used single strand 
conformation polymorphism (SSCP) analysis to confirm which components 
represented the truly differentially expressed product. 

Many scientists often try to avoid the use of PAGE where possible because it is 
technically more demanding than agarose gel electrophoresis (AGE). Unfortunately, 
high resolution agarose gels such as Metaphor (FMC, Lichfield, UK) and AquaPor 
HR (National Diagnostics, Hessle, UK), whilst easier to prepare and manipulate 
than PAGE, can only separate DXA sequences which differ in size by around 
1.5-2 ° 0 (15-20 base pairs for a 1Kb fragment). Thus, SSH. RDA or other such 
products which differ in size by less than this amount are normally not resolvable. 
However, a simple technique does in fact exist for increasing the resolving power of 
AGE— the inclusion of HA-red (10-phenyl neutral red-PEG ligand) or HA-yellow 
(bisbenzamide-PEG ligand) (Hanse Analyrik GmbH, Bremen, Germany) in a 
gel separates identical or closely sized products on base content. Specifically, 
HA-red and -yellow selectively bind to GC and AT DNA motifs, respectively 
•Wawer et al. 1993, Hanse Anaiytik 1997, personal communication). Since both 
HA-stains possess an overall positive charge, they migrate towards the cathode 
when an electric field is applied. This is in direct opposition to DNA, which 
is negatively charged and, therefore, migrates towards the anode. Thus, if two 
DNA clones are identical in size (as perceived on a standard high resolution 
agarose gel), but differ in AT/GC content, inclusion of a HA-dye in the gel 
will effectively retard the migration of one of the sequences compared to the 
other, effectively making it apparently larger and, thus, providing a means of 
differentiating between the two. The use of HA-red has been shown to resolve 
sequences with an AT variation of less than 1 % (Wawer et al. 1995), whilst Hanse 
Anaiytik have reported that HA staining is so sensitive that in one case it was used 
to distinguish two 567bp sequences which-differed by only a single point mutation 
(Hanse Anaiytik 1996, personal communicati on). Therefore, if one wishes to check 
whether all the clones produced from a specific band in a differential display 
-experiment-are derived from the- same gene s p e c i e s, a small-amount of reamplified 
or digested clone can be run on a standard high resolution gel, and a second aliquot 
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Figure 10. Discrimination of clones of identical / ncarlv identical sue usin* H \ r*A n 3n H« a 

the sequence, clearly mduates the presence of different gene spec.es w.th.n each h.nrf fZ 



in a similar gel containing one of the HA-stains. The standard gel should indicate 
any g ross sue differences, whilst the HA-stained gel should separate otherwise 

et al. (199/) reported successful use of this approach for identifying DD-derived 
clones. Figure 10 shows such an experiment carried out in this laboratory on clones 
obtained from a band extracted from an SSH display 

»ro^T a !! Ve aPPn,a l h iS l ° ° Ut 3 2 -° ana,ys,s or " the °'nerennal dispiav 
Z J^T , Tt" aPP ^° aeh - S, "* baSCd "P^ 10 " 15 »™ earned out m a standard 
agarose gel. The gel sl.ee containing the display ,s then extracted and incorporated 
in to a HA gel for resolution based on AT/GC content. 

Of course, one should always consider the possibility of there being different 
gene species which are the same size and have the same GC/AT content However 
even these species are not unresolvable given some effon-agam. one might use 
bbCP. or perhaps a denaturing gradient gel electrophoresis ( DGGE) or temperature 
gradient field electrophoresis (TGGE) approach to resolve the contents of a band 
product ° n eXtn,Cted band ( Suzuki rt al - 1991) or on the reamplified 

The requirement of some differential display techniques to visualize large 
numbers of products (e.g. DD and GEF) can also present a problem in that, in terms 
of numbers, the resolution of PAGE rarely exceeds 300-^00 bands. One approach to 

Z7w£ Z w h u mi f ' ^ X ?^?' DlS *' 8UC *™^<*cribed by Uinerlinden et 

al. (1989) and Hatada et al. (1991). - 
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Extraction of differentially expressed bands from a gel can be complex since, m 
some cases (e.g. DD, GEF). the results are visualized by autoradiographic means, 
such that precise overlay of the developed film on the gel must occur if the correct 
band is to be extracted for further analysis. Clearly, a misjudged extraction can 
account for many man-hours lost. This problem, and that of the use of radioisotopes, 
has been addressed by several groups. For example, -Lohmann et aL (1995) 
demonstrated that silver staining can be used directly to visualize DD bands in 
horizontal PAGs. An et aL (1996) avoided the use of radioisotopes by transferring a 
small amount (20-30 ° 0 ) of the DNA from their DD to a nylon membrane, and 
visualizing the bands using chemiluminescent staining before going back to extract 
the remaining DNA from the gel. Chen and Peck (1996) went one step further and 
transferred the entire DD to a nylon membrane. The DNA bands were then 
visualized using a digoxigenin (DIG) system (DIG was attached to the polydT 
primers used in the differential display procedure). Differentially expressed bands 
were cut from the membrane and the DNA eluted by washing with PCR buffer prior 
to reamplification. 

One of the advantages of using techniques such as SSH and RDA is that the final 
display can be run on an agarose gel and the bands visualized with simple ethidium 
bromide staining. Whilst this approach can provide acceptable results, overstating 
with SYBR Green I or SYBR Gold nucleic acid stains (FMC) effectively enhances 
the intensity and sharpness of the bands. This greatly aids in their precise extraction 
and often reveals some faint products that may otherwise be overlooked. Whilst 
differential displays stained with SYBR Green I are berter visualized using short 
wavelength UV (254 nm) rather than medium wavelength (306 nm), the shorter 
wavelength is much more DNA damaging. In practice, it takes only a few seconds 
to damage DNA extracted under 254 nm irradiation, effectively preventing 
reamplification and cloning. The best approach is to overstain with SYBR Green I 
and extract bands under a medium wavelength UV transillumination. 



The possible use of , mic^ofingerprinting , to reduce complexity 

Given the sheer number of gene products and the possible complexity of each 
band, an alternative approach to rapid characterization may be to use an enhanced 
analysis of a small section of a differential display — a * sub- fingerprint ' or 'micro- 
nngerpnnt*. In this case, one couid concentrate on those bands which oniy appear 
in a particular chosen size region. Reducing the fingerprint in this way has at least 
two advantages. One is that it should be possible to use different gel types, 
concentrations and run times tailored exactly to that region. Currently, one might 
run products from 1 00-3000 + bp on the same gel, which leads to compromize in the 
gel system being used and consequently to suboptimal resolution, both in terms of 
size and numbers, and can lead to problems in the accurate excision of individual' 
bands. Secondly, it may be possible to enhance resolution by using a 2-D analysis 
using a HA-stain, as described earlier. In summary, if a range of gene product sizes 
is carefully chosen to included certain 4 relevant ' genes, the 2-D system standardized, 
and appropriate gene analysis used, it may be possible to develop a method for the 
early and rapid identification of compounds which have similar or widely different 
" cellular effects. If the prognosis for exposure to one or more other chemicals which 
display, a similar^ profile is already . kn own , then one cou ld perhaps predict similar 
effects for any new compounds which show a similar micro-fingerprint. 
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An alternative approach to microhngerpnnting is to examine altered expression 
in specific families of genes through careful selection of PCR primers and /or post- 
reaction analysis. Stress genes, growth factors and/or their receptors, cell cvcling 
genes, cytochromes P450 and regulator.- proteins might be cons.dered as candidates 
for analysis in this way. Indeed, some off-the-shelf DNA arravs (e.g. Clontech s 
Atlas cDNA Expression Array series) already anticipated this to some degree by 
grouping together genes involved in different responses e.g. apoptosis. stress. DN A- 
damage response etc. 



Screening 

False positives 

The generation of false positives has been discussed at length amongst the 
differential display communing Liang etal. 1993. 1995. Nishio*ta/. 1994. Sun et al. 
1994, Sompayrac et al. 1995). The reason for false positives varies with the 
technique being used. For instance, in RDA. the use of adaptors which have not 
been HPLC purified can lead to the production of false positives through illegitimate 
ligation events (O'Neill and Sinclair 1997). whilst in DD thev can anse through 
artifacts and illegitemate transcription of rRNA. In SH. false positives appear 
to be derived largely from abundant gene species, although some may arise from 
cDNA/mRNA species which do not undergo hybridization for technical reasons. 

A quick screening of putative differentially expressed clones can be carried out 
using a simple dot blot approach, in which labelled first strand probes svnthesized 
from tester and driver mRNA are hybridized to an arrav of said clones (Hedrick et 
al. 1984. Sakaguchi et al. 1986). Differentially expressed clones will hvbridize to 
tester probe, but not driver. The disadvantage of this approach is that rare species 
may not generate detectable hybridization signals. One option for those using SSH 
is to screen the clones using a labelled probe generated from the subtracted cDNA 
from which it was derived, and with a probe made from the reverse subtraction 
reaction (ClonTechniques 1997a). Since the SSH method enriches rare sequences, 
it should be possible to confirm the presence of clones representing low abundance 
genes. Despite this quick screening step, there is still the need to go back to the 
original mRNA and confirm the altered expression using a more quantitative 
approach. Although this may be achieved using Northern blots, the sensitivity is 
poor by today's high standards and one must rely on PCR methods for accurate and 
sensitive determinations (see below). 



Sequence analysis 

The majority of differential display procedures produce final products which are 
between 100 and lOOObp in size. However, this may considerably reduce the size of 
the sequence for analysis of the DNA databases. This in rum leads to a reduced 
confidence in the result— several families of genes have members whose DNA 
-sequences are- almost i duiULal cxi e pi in i few key stretches; e.g. the cytochrome 
P450 gene superfamily (Nelson et a/._1996). Thus, does the clone identified as being 
almost identical to gene X* really come from that gene, or its brother gene X, or its 
as yet undiscovered sister X,? F6Texample,"0iing SSH; "pin of a gene was isolated, 
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which was up-regulated in the liver of rats exposed to Wy- 14.643 and was identified 
by a FASTA search as being transferrin (data not shown). However, transferrin is 
known to be downregulated by hypolipidemic peroxisome proliterators such as Wy- 
14.643 (Hera et al. 1996). and this was confirmed with subsequent RT-PCR 
analysis. This suggests that the gene sequence isolated may beiong to a gene which 
is closely related to transferrin, but is regulated by a different mechanism. 

A further problem associated with 5H technology is redundancy. In most cases 
before SH is carried out, the cDNA population must first be simplified by restriction 
digestion. This is important for.at least two reasons : 

(1) To reduce complexity— long cDNA fragments may form complex networks 
which prevent the formation of appropriate hybrids, especially at the high 
concentrations required for efficient hybridization. 

(2) Cutting the cDNAs into small fragments provides better representation of 
individual genes. This is because genes derived from related but distinct 
members of gene families often have similar coding sequences that may cross- 
hybridize and be eliminated during the subtraction procedure (Ko 1990). 
Furthermore, different fragments from the same cDNA may differ considerably 
in terms of hybridization and amplification and, thus, may not efficiently do one 
or the other (Wang and Brown 1991). Thus, some fragments from differentially 
expressed cDNAs may be eliminated during subtractive hybridization pro- 
cedures. However, other fragments may be enriched and isolated. As a 
consequence of this, some genes will be cut one or more times, giving rise to two 
or more fragments of different sizes. If those same genes are differentially 
expressed, then two or more of the different size fragments may come through 
as separate bands on the final differential display, increasing the observed 
redundancy and increasing the number of redundant sequencing reactions. 

Sequence comparisons also throw up another important point — at what degree 
of sequence similarity does one accept a result. Is 90 ° 0 identitiy between a gene 
derived from your model species and another acceptably close? Is 95 ° 0 between 
your sequence and one from the same species also acceptable? This problem is 
particularly relevant when the forward and reverse sequence comparisons give 
similar sequences with completely different gene species! An arbitrary decision 
seems to be to allocate genes that are definite (95 ° n and above similantv) and then 
group those between 60 and 95 ° 0 as being related or possible homologues. 
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Quantitative analysis 

At some point, one must give consideration to the quantitative analysis of the 
candidate genes, either as a means of confirming that they are truly differentially 
expressed, or in order to establish just what the differences are. Northern blot 
analysis is a popular approach as it is relatively easy and quick to perform. However, 
the major drawback with Northern blots is that they are often not sensitive enough 
to detect rare sequences. Since the majority of messages expressed in a cell are of low 
abundance (see table 1 ), this is a major problem. Consequently, RT-PCR may be the 
-—method of choice for confirming^ d iffe r e n ti dl ejipie^siun. A lthough the procedure is 
somewhat more complex than Northern analysis, requiring synthesis of primers and 
optimization of reaction conditions for each gene species, it is now possible to set up 
high throughput PCR systems"usihg mulitchanhel pipettes, 96 + -well plates and 
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appropriate thermal cycling technology. Whilst quantitative analvs.s ,s more 
desirable, being more accurate and without reliance on an internal standard the 
money and time needed to develop a competitor molecule is often excessive 
especially when one might be examining tens or even hundreds of gene species The 
use or semi-quantitative analysis is simpler, although still relatively involved One 
must first of all choose an internal standard that does not change in the test cells 
compared to the controls. Numerous reference genes have been tried in the past for 
example imerferon-gamma (IFN-7, Frye et al. 1989). £-act,n (Heuval « al 1994) 
glyceraldehyde-3-phosphate dehydrogenase (GAPDH. Wong et al 1994) d ," 
hydrofolate reductase (DHFR. Mohler and Butler 1991). /^-microglobulin <V- 
m. Murphy et al. 1990). hypoxanthme phosphoribosvl transferase <HPRT Foss "« 
al. 1998) and a number of others (ClonTechniques 1997b). Ideallv an internal 
standard should not change its level of expression in the cell regardless of cell age 
stage m the cell cycle or through the effects of external stimuli. However, it has been 
shown on numerous occasions that the levels of most housekeeping genes currently 
used by the research community do in fact change under certain conditions and in 
different tissues (ClonTechniques 1997b). It is imperative, therefore, that pre- 
ltm.nary experiments be earned out on a panel of housekeeping genes to establish 
their suitability for use in the model system. 

Interpretation of quantitative data must also be treated with caution Bv 
comparing the lists of genes identified by differential expression one can perhaps 
gain insight into why two different species react in different wavs to external stimuli 
For example, rats and mice appear sensitive to the non-genotoxic effects of a wide 
range of peroxisome proliferates whilst Syrian hamsters and guinea pigs are lareelv 
resistant (Orton et al. 1984. Rodricks and Tumbull 1987. Lake et al 1989 1993 
Makowska et al. 1992). A simplified approach to resolving the reason(s) why is to 
compare lists of up- and down- regulated genes in order to identify those which are 
expressed in only one species and. through background knowledge of the effects of 
the said gene, might suggest a mechanism of facilitated non-genotoxic carcinogenesis 
or protection. Of course, the situation is likely to be far more complex. Perhaps if 
there were one key gene protecting guinea pig from non-genotoxic effects and it was 
upregulated ,0 times by PPs. the same gene might only be up-regulated five times 
m the rat. However, since both were noted to be upregulated. the imoortance of the 
gene may be overlooked. Just to complicate marters. a iarse cnange in expression 
does not necessarily mean a biologically important change. For example, what is the 
true relevance of gene Y which shows a 50-fold increase after a particular treatment 
and gene Z which shows only a 5-fold increase ? If one examines the literature one 
may find that historically, gene Y has often been shown to be up-regulated 40-«0- 
fold by a number of unrelated stimuli— in lighToTThis the 50-fold increase would 
appear less significant. However, the literature may show that gene 2 has never been 
recorded as having more than doubled in expression— which makes your 5-fold 
increase all the more exciting. Perhaps even more interesting is if that same 5-fold 
increase has only been seen in related neoprasfhTor following treatment with related 
chemicals. _ 

Problems In using the bUfferentiaT'display approach 

Differential display technology originally held. promise of an easily obtainable 
nngetpnnt of those genes which are up- or down-regulated in test animals/cells in 
a developmental process or following exposure to given stimuli. However it has 
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become clear that the fingerprinting process, whilst still valid, is much too complex 
to be represented by a single technique profile. This is because all differential display 
techniques have common and/or unique technical problems which preclude the 
isolation and identification of all those genes which show changes in expression. 
Furthermore, there are important genetic changes related to disease development 
which differential expression analysis is simply not designed to address. An example 
of this is the presence of small deletions, insertions, or point mutations such as those 
seen in activated oncogenes, tumour suppressor genes and individual poly- 
morphisms. Polymorphic variations, small though they usually are. are often 
regarded as being of paramount importance in explaining why some patients 
respond better than others to certain drug treatments land, in logical extension, whv 
some people are less affected by potentially dangerous xenobiotics / carcinogens than 
others). The identification of such point mutations and naturally occurring 
polymorphisms requires the subsequent application of sequencing, SSCP. DGGE 
or TGGE to the gene of interest. Furthermore, differential display is not designed 
to address issues such as alternatively spliced gene species or whether an increased 
abundance of mRNA is a result of increased transcription or increased mRNA 
stability. 



Conclusions 

Perhaps the main advantage of open system differential display techniques is that 
they are not limited by extant theories or researcher bias in revealing genes which are 
differentially expressed, since they are designed to amplify all genes which 
demonstrate altered expression. This means that they are useful for the isolation of 
previously unknown genes which may turn out be useful biomarkers of a particular 
state or condition. At least one open system (SAGE) is also quantitative, thus 
eliminating the need to return to the original mRNA and carry out Northem/PCR 
analysis to confirm the result. However, the rapid progress of genome mapping 
projects means that over the next 5-10 years or so, the balance of experimental use 
will switch from open to closed differential display systems, particularly DNA 
arrays. Arrays are easier and faster to prepare and use. provide quantitative data, are 
suitable for high throughput analysis and can be tailored to look at specific sienailing 
pathways or families of genes. Identification of all the gene sequences in human and 
common laboratory animals combined with improved DNA array technology, 
means that it will soon no longer be necessary to try to isolate differentially expressed 
genes using the technically more demanding open system approach. Thus, their 
..main advantage (that of identifying unknown genes) will be largely eradicated. It is 
likely, therefore, that their sphere of application will be reduced to analysts of the 
less common laboratory species, since it will be some time yet before the genomes of 
such animals as zebrafish, electric eels, gerbils, crayfish and squid, for example, will 
be sequenced. 

Of course, in the end the question will always remain: What is the functional/ 
biological significance of the identified, differentially expressed genes? One 
persistent problem is understanding whether differentially expressed genes are a 
_ cause or consequence of the altered state. Furthermore, many chemicals, such as 
non-genotoxic carcinogens, axe also mitogens and so genes associated with 
replication will also be upregulated but may have little or nothing to do with the 
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carcinogenic effect. Whilst differem.al display technology cannot hope to answer 
these questions. „ does provide a spnngboard from which 'identincat.or, regulator 
and functional srud.es can be launched. Understating the molecular mechanism of 
cellular responses ,s almost impossible without know.ng the regulanon and function 

liir ?f 7 COndi "° n U * mUt3ted) - In ™ abstra « dmerential 

* ""J * h K ken J d IO a «*» Photograph, showing derails of a fixed moment ,n 
time. Consider the Historian who knows the outcome of a battle and the placement 

£lc° nd , lt, °\ 2" 7°° PS bef ° re thC ba " le comm ««d. bur ,s asked to trv and 
deduce how the battle progressed and why ,t ended as it did from a few «,!! 
photographs-an impossible task. In order to understand the battle, the Histonan 
must find out the capability and motivation of the soldiers and their command,™ 
officers, what the orders were and whether they were obeyed. He must examine the 
terrain, the remains of the battle and cons.der the effects the prevail,^ weathe 
conditions exerted Likew.se. if mechanistic answers are to be forthcoming the 
scientist must use differential display in combination with other techn.ques. such al 

ttL°e n 0gy> ° f " U S,gnaUing Pathwa ^ s - -utation analvsis and 

time and dose response analyses. Although this review has emphasized the 

he full impact of this approach will be strengthened if used ,n combination wuh 
unctional genomics and proteomics (2 -dimensional protein gels from isoeLrk 
focusing and subsequent SDS electrophoresis and virtual 2D-maps using capUla^v 
electrophoresis). Proteomics is attracting much recent attention as miny of Z 
changes resulting m differential gene expression do not involve changes in mRNA 
levels, as decnbed extensively herem. but rather protein-protein. protein-DNA and 
protem phosphorylation events which would require functional genomics or 
proteomic technologies for investigation. 

Despite the limitations of differential display technology, it is clear that manv 
potennal applications and benefits can be obtained from characterizing the genetic 
changes that occur in a cell during normal and disease development and in respond 

oroviri"fi° r ° ,0g . ,CaJ , ^ ln li8hl ° f fun « i0nal data ' such Profiling^ 
tershVul^^T °, f ^ St3ge ° f devel °P mcnl '"ponse. and ,n the long 
«rm should help m the elucidation of specific and sensitive biomarkers for different 
types of chemical/biological exposure and disease states. The potential medical and 
therapeutic benents of understanding such molecular chafes are almost im- 
measurable. Amongst other things, such ringerpnnts could indicate the familv or 
evea specific type of chemical an individual has been exposed to plus the length 
and/or acuteness of that exposure, thus indicating the most prudent treatment! 
They may also help uncover differences in histologically identical cancers, provide 
diagnostic tests for the earliest stages of neoplasia and. again, perhaps indicate the 

most efficacious treatment. - " — 

The Human Genome Project will be completed early in the next century and the 
DNA sequence of all the human genes will be known. The continuing development 
and evolution of differential gene expression technology will ensure that this 
knowledge contributes fully to the understanding of human disease processes. 
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ABSTRACT Pairwise sequence comparison methods have 
been assessed using proteins whose relationships are known 
reliably from their structures and functions, as described in 
the SCOP database [Murzin, A. G., Brenner, S. E., Hubbard, T. 
& Chothia C. (1995) J. Mol. Biol. 247, 536-540]. The evalua- 
tion tested the programs BLAST [Altschul, S. F., Gish, W., 
Miller, W., Myers, E. W. & Lipman, D. J. (1990)./. Mol. Biol. 
215, 403-410J, WU-BLAST2 [Altschul, S. F. & Gish, W. (1996) 
Methods EnzymoL 266, 460-480], FASTA [Pearson, W. R. & 
Lipman, D.J. (1988) P/w. Natl. Acad. Sci. USA 85,2444-2448], 
and SSEARCH [Smith, T. F. & Waterman, M. S. (1981) J. Mol. 
Biol. 147, 195-197] and their scoring schemes. The error rate 
of all algorithms is greatly reduced by using statistical scores 
to evaluate matches rather than percentage identity or raw 
scores. The E-value statistical scores of SSEARCH and FASTA are 
reliable: the number of false positives found in our tests agrees 
well with the scores reported. However, the P-values reported 
by BLAST and WU-BLAST2 exaggerate significance by orders of 
magnitude, ssearch, fasta ktup = 1, and wu-blast2 perform 
best, and they are capable of detecting almost all relationships 
between proteins whose sequence identities are >30%. For 
more distantly related proteins, they do much less well; only 
one-half of the relationships between proteins with 20-30% 
identity are found. Because many homologs have low sequence 
similarity, most distant relationships cannot be detected by 
any pairwise comparison method; however, those which are 
identified may be used with confidence. 



Sequence database searching plays a role in virtually every 
branch of molecular biology and is crucial for interpreting the 
sequences issuing forth from genome projects. Given the 
method's central role, it is surprising that overall and relative 
capabilities of different procedures are largely unknown. It is 
difficult to verify algorithms on sample data because this 
requires large data sets of proteins whose evolutionary rela- 
tionships are known unambiguously and independently of the 
methods being evaluated. However, nearly all known ho- 
mologs have been identified by sequence analysis (the method 
to be tested). Also, it is generally very difficult to know, in the 
absence of structural data, whether two proteins that lack clear 
sequence similarity are unrelated. This has meant that al- 
though previous evaluations have helped improve sequence 
comparison, they have suffered from insufficient, imperfectly 
characterized, or artificial test data. Assessment also has been 
problematic because high quality database sequence searching 
attempts to have both sensitivity (detection of homologs) and 
specificity (rejection of unrelated proteins); however, these 
complementary goals are linked such that increasing one 
causes the other to be reduced. 
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Sequence comparison methodologies have evolved rapidly, 
so no previously published tests has evaluated modern versions 
of programs commonly used. For example, parameters in 
blast (1) have changed, and WU-BLAST2 (2) — which produces 
gapped alignments — has become available. The latest version 
of fasta (3) previously tested was 1.6, but the current release 
(version 3.0) provides fundamentally different results in the 
form of statistical scoring. 

The previous reports also have left gaps in our knowledge. 
For example, there has been no published assessment of 
thresholds for scoring schemes more sophisticated than per- 
centage identity. Thus, the widely discussed statistical scoring 
measures have never actually been evaluated on large data- 
bases of real proteins. Moreover, the different scoring schemes 
commonly in use have not been compared. 

Beyond these issues, there is a more fundamental question: 
in an absolute sense, how well does pairwise sequence com- 
parison work? That is, what fraction of homologous proteins 
can be detected using modern database searching methods? 

In this work, we attempt to answer these questions and to 
overcome both of the fundamental difficulties that have hin- 
dered assessment of sequence comparison methodologies. 
First, we use the set of distant evolutionary relationships in the 
scop: Structural Classification of Proteins database (4), which 
is derived from structural and functional characteristics (5). 
The SCOP database provides a uniquely reliable set of ho- 
mologs, which are known independently of sequence compar- 
ison. Second, we use an assessment method that jointly mea- 
sures both sensitivity and specificity. This method allows 
straightforward comparison of different sequence searching 
procedures. Further, it can be used to aid interpretation of real 
database searches and thus provide optimal and reliable 
results. 

Previous Assessments of Sequence Comparison. Several 
previous studies have examined the relative performance of 
different sequence comparison methods. The most encom- 
passing analyses have been by Pearson (6, 7), who compared 
the three most commonly used programs. Of these, the Smith- 
Waterman algorithm (8) implemented in SSEARCH (3) is the 
oldest and slowest but the most rigorous. Modern heuristics 
have provided BLAST (1) the speed and convenience to make 
it the most popular program. Intermediate between these two 
is fasta (3), which may be run in two modes offering either 
greater speed (ktup = 2) or greater effectiveness (ktup =1). 
Pearson also considered different parameters for each of these 
programs. 

To test the methods, Pearson selected two representative 
proteins from each of 67 protein superfamilies defined by the 
pir database (9). Each was used as a query to search the 
database, and the matched proteins were marked as being 
homologous or unrelated according to their membership of PIR 
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superfamilies. Pearson found that modem matrices and "In- 
scaling" of raw scores improve results considerably. He also 
reported that the rigorous Smith- Waterman algorithm worked 
slightly better than fast a, which was in turn more effective 
than blast. 

Very large scale analyses of matrices have been performed 
(10), and Henikoff and Henikoff (11) also evaluated the 
effectiveness of blast and fasta. Their test with blast 
considered the ability to detect homologs above a predeter- 
mined score but had no penalty for methods which also 
reported large numbers of spurious matches. The Henikoffs 
searched the swiss-prot database (12) and used prosite (13) 
to define homologous families. Their results showed that the 
BLOSUM62 matrix (14) performed markedly better than the 
extrapolated PAM-series matrices (15), which previously had 
been popular. 

A crucial aspect of any assessment is the data that are used 
to test the ability of the program to find homologs. But in 
Pearson's and the Henikoffs* evaluations of sequence com- 
parison, the correct results were effectively unknown. This is 
because the superfamilies in PIR and PROSITE are principally 
created by using the same sequence comparison methods 
which are being evaluated. Interdependency of data and 
methods creates a "chicken and egg" problem, and means for 
example, that new methods would be penalized for correctly 
identifying homologs missed by older programs. For instance, 
immunoglobulin variable and constant domains are clearly 
homologous, but PIR places them in different superfamilies. 
The problem is widespread: each super family in PIR 48.00 with 
a structural homolog is itself homologous to an average of 1.6 
other PIR superfamilies (16). 

To surmount these sorts of difficulties, Sander and Schnei- 
der (17) used protein structures to evaluate sequence com- 
parison. Rather than comparing different sequence compari- 
son algorithms, their work focused on determining a length- 
dependent threshold of percentage identity, above which all 
proteins would be of similar structure. A result of this analysis 
was the hssp equation; it states that proteins with 25% identity 
over 80 residues will have similar structures, whereas shorter 
alignments require higher identity. (Other studies also have 
used structures (18-20), but these focused on a small number 
of model proteins and were principally oriented toward eval- 
uating alignment accuracy rather than homology detection.) 

A general solution to the problem of scoring comes from 
statistical measures (i.e., E-values and P-values) based on the 
extreme value distribution (21). Extreme value scoring was 
implemented analytically in the BLAST program using the 
Karl in and Altschul statistics (22, 23) and empirical ap- 
proaches have been recently added to fasta and ssearch. In 
addition to being heralded as a reliable means of recognizing 
significantly similar proteins (24, 25), the mathematical trac- 
tability of statistical scores "is a crucial feature of the BLAST 
algorithm" (1). The validity of this scoring procedure has been 
tested analytically and empirically (see ref. 2 and references in 
ref. 24). However, all large empirical tests used random 
sequences that may lack the subtle structure found within 
biological sequences (26, 27) and obviously do not contain any 
real homologs. Thus, although many researchers have sug- 
gested that statistical scores be used to rank matches (24, 25, 
28), there have been no large rigorous experiments on biolog- 
ical data to determine the degree to which such rankings are 
superior. 

A Database for Testing Homology Detection. Since the 
discovery that the structures of hemoglobin and myoglobin are 
very similar though their sequences are not (29), it has been 
apparent that comparing structures is a more powerful (if less 
convenient) way to recognize distant evolutionary relation- 
ships than comparing sequences. If two proteins show a high 
degree of similarity in their structural details and function, it 



is very probable that they have an evolutionary relationship 
though their sequence similarity may be low. 

The recent growth of protein structure information com- 
bined with the comprehensive evolutionary classification in 
the SCOP database (4, 5) have allowed us to overcome previous 
limitations. With these data, we can evaluate the performance 
of sequence comparison methods on real protein sequences 
whose relationships are known confidently. The SCOP database 
uses structural information to recognize distant homologs, the 
large majority of which can be determined unambiguously. 
These superfamilies, such as the globins or the immunoglobu- 
lins, would be recognized as related by the vast majority of the 
biological community despite the lack of high sequence sim- 
ilarity. 

From scop, we extracted the sequences of domains of 
proteins in the Protein Data Bank (PDB) (30) and created two 
databases. One (PDB90D-B) has domains, which were all <90% 
identical to any other, whereas (PDB40D-B) had those <40% 
identical. The databases were created by first sorting all 
protein domains in SCOP by their quality and making a list. The 
highest quality domain was selected for inclusion in the 
database and removed from the list. Also removed from the list 
(and discarded) were all other domains above the threshold 
level of identity to the selected domain. This process was 
repeated until the list was empty. The PDB40D-B database 
contains 1,323 domains, which have 9,044 ordered pairs of 
distant relationships, or «0.5% of the total 1,749,006 ordered 
pairs. In PDB90D-B, the 2,079 domains have 53,988 relation- 
ships, representing 1.2% of all pairs. Low complexity regions 
of sequence can achieve spurious high scores, so these were 
masked in both databases by processing with the seg program 
(27) using recommended parameters: 12 1.8 2.0. The databases 
used in this paper are available from http://sss.stanford.edu/ 
sss/, and databases derived from the current version of scop 
may be found at http://scop.mrc-lmb.cam.ac.uk/scop/. 

Analyses from both databases were generally consistent, but 
PDB40D-B focuses on distantly related proteins and reduces the 
heavy overrepresentation in the PDB of a small number of 
families (31, 32), whereas PDB90D-B (with more sequences) 
improves evaluations of statistics. Except where noted other- 
wise, the distant homolog results here are from PDB40D-B. 
Although the precise numbers reported here are specific to the 
structural domain databases used, we expect the trends to be 
general. 

Assessment Data and Procedure. Our assessment of se- 
quence comparison may be divided into four different major 
categories of tests. First, using just a single sequence compar- 
ison algorithm at a time, we evaluated the effectiveness of 
different scoring schemes. Second, we assessed the reliability 
of scoring procedures, including an evaluation of the validity 
of statistical scoring. Third, we compared sequence compari- 
son algorithms (using the optimal scoring scheme) to deter- 
mine their relative performance. Fourth, we examined the 
distribution of homologs and considered the power of pairwise 
sequence comparison to recognize them. All of the analyses 
used the databases of structurally identified homologs and a 
new assessment criterion. 

The analyses tested BLAST (1), version 1.4.9MP, and wu- 
BLAST2 (2), version 2.0a 13MP. Also assessed was the fasta 
package, version 3.0t76 (3), which provided fasta and the 
ssearch implementation of Smith-Waterman (8). For 
ssearch and fasta, we used BLOSUM4S with gap penalties 
-12/-1 (7, 16). The default parameters and matrix (BLO- 
SUM62) were used for BLAST and wu-blastz 

The "Coverage Vs. Error" Plot To test a particular protocol 
(comprising a program and scoring scheme), each sequence 
from the database was used as a query to search the database. 
This yielded ordered pairs of query and target sequences with 
associated scores, which were sorted, on the basis of their 
scores, from best to worst. The ideal method would have 
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Fig. 1. Coverage vs. error plots of different scoring schemes for ssearch Smith-Waterman. (A) Analysis of PDB40D-B database. (B) Analysis 
of PDB90D-B database. All of the proteins in the database were compared with each other using the ssearch program. The results of this single 
set of comparisons were considered using five different scoring schemes and assessed. The graphs show the coverage and errors per query (EPQ) 
for statistical scores, raw scores, and three measures using percentage identity. In the coverage vs. error plot, the x axis indicates the fraction of 
all homologs in the database (known from structure) which have been detected. Precisely, it is the number of detected pairs of proteins with the 
same fold divided by the total number of pairs from a common superfamity. PDB40D-B contains a total of 9,044 homologs, so a score of 10% indicates 
identification of 904 relationships. The y axis reports the number of EPQ. Because there are 1,323 queries made in the PDB40D-B all-vs.-all 
comparison, 13 errors corresponds to 0.01, or X% EPQ. They axis is presented on a log scale to show results over the widely varying degrees of 
accuracy which may be desired. The scores that correspond to the levels of EPQ and coverage are shown in Fig. 4 and Table 1. The graph 
demonstrates the trade-off between sensitivity and selectivity. As more homologs are found (moving to the right), more errors are made (moving 
up). The ideal method would be in the lower right corner of the graph, which corresponds to identifying many evolutionary relationships without 
selecting unrelated proteins. Three measures of percentage identity are plotted. Percentage identity within alignment is the degree of identity within 
the aligned region of the proteins, without consideration of the alignment length. Percentage identity within both is the number of identical residues 
in the aligned region as a percentage of the average length of the query and target proteins. The hssp equation (17) is H = 290.15/~ 0 - 562 where 
/ is length for 10 < / < 80; H > 100 for / < 10; H = 24,7 for / > 80. The percentage identity Hssp-adjusted score is the percent identity within 
the alignment minus H. Smith- Waterman raw scores and E-values were taken directly from the sequence comparison program. 



perfect separation, with all of the homologs at the top of the 
list and unrelated proteins below. In practice, perfect separa- 
tion is impossible to achieve so instead one is interested in 
drawing a threshold above which there are the largest number 
of related pairs of sequences consistent with an acceptable 
error rate. 

Our procedure involved measuring the coverage and error 
for every threshold. Coverage was defined as the fraction of 
structurally determined homologs that have scores above the 
selected threshold; this reflects the sensitivity of a method. 
Errors per query (EPQ), an indicator of selectivity, is the 
number of nonhomologous pairs above the threshold divided 
by the number of queries. Graphs of these data, called 
coverage vs. error plots, were devised to understand how 



protocols compare at different levels of accuracy. These 
graphs share effectively all of the beneficial features of Re- 
ciever Operating Characteristic (ROC) plots (33, 34) but 
better represent the high degrees of accuracy required in 
sequence comparison and the huge background of nonho- 
mologs. 

This assessment procedure is directly relevant to practical 
sequence database searching, for it provides precisely the 
information necessary to perform a reliable sequence database 
search. The EPQ measure places a premium on score consis- 
tency; that is, it requires scores to be comparable for different 
queries. Consistency is an aspect which has been largely 
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Fig. 2. Unrelated proteins with high percentage identity. Hemo- 
globin 0-chain (PDB code lhds chain b, ref 38, Left) and cellulase E2 
(pdb code Itml, ref. 39, Right) have 39% identity over 64 residues, a 
level which is often believed to be indicative of homology. Despite this 
high degree of identity, their structures strongly suggest that these 
proteins are not related. Appropriately, neither the raw alignment 
score of 85 nor the E-value of 1.3 is significant. Proteins rendered by 
RASMOL (40). 
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Fig. 3. Length and percentage identity of alignments of unrelated 
proteins in PDB90D-B: Each pair of nonhomologous proteins found with 
ssearch is plotted as a point whose position indicates the length and 
the percentage identity within the alignment. Because alignment 
length and percentage identity are quantized, many pairs of proteins 
may have exactly the same alignment length and percentage identity. 
The line shows the hssp threshold (though it is intended to be applied 
with a different matrix and parameters). 
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Fig. 4. Reliability of statistical scores in pdbmd-b: Each line shows 
the relationship between reported statistical score and actual error 
rate for a different program. E- values are reported for ssearch and 
fasta, whereas P-values are shown for blast and wu-blastz If the 
scoring were perfect, then the number of errors per query and the 
E-values would be the same, as indicated by the upper bold line. 
(P-values should be the same as EPQ for small numbers, and diverges 
at higher values, as indicated by the lower bold line.) E-values from 
ssearch and fasta are shown to have good agreement with EPQ but 
underestimate the significance slightly, blast and WU-BLAST2 are 
overconfident, with the degree of exaggeration dependent upon the 
score. The results for PDB40D-B were similar to those for pdbwd-B 
despite the difference in number of homologs detected. This graph 
could be used to roughly calibrate the reliability of a given statistical 
score. 

ignored in previous tests but is essential for the straightforward 
or automatic interpretation of sequence comparison results. 
Further, it provides a clear indication of the confidence that 
should be ascribed to each match. Indeed, the EPQ measure 
should approximate the expectation value reported by data- 
base searching programs, if the programs' estimates are accu- 
rate. 

The Performance of Scoring Schemes. All of the programs 
tested could provide three fundamental types of scores. The 
first score is the percentage identity, which may be computed 
in several ways based on either the length of the alignment or 
the lengths of the sequences. The second is a "raw" or 
"Smith-Waterman" score, which is the measure optimized by 
the Smith-Waterman algorithm and is computed by summing 
the substitution matrix scores for each position in the align- 
ment and subtracting gap penalties. In BLAST, a measure 
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related to this score is scaled into bits. Third is a statistical 
score based on the extreme value distribution. These results 
are summarized in Fig. 1. 

Sequence Identity. Though it has been long established that 
percentage identity is a poor measure (35), there is a common 
rule-of-thumb stating that 30% identity signifies homology. 
Moreover, publications have indicated that 25% identity can 
be used as a threshold (17, 36). We find that these thresholds, 
originally derived years ago, are not supported by present 
results. As databases have grown, so have the possibilities for 
chance alignments with high identity; thus, the reported cutoffs 
lead to frequent errors. Fig. 2 shows one of the many pairs of 
proteins with very different structures that nonetheless have 
high levels of identity over considerable aligned regions. 
Despite the high identity, the raw and the statistical scores for 
such incorrect matches are typically not significant. The prin- 
cipal reasons percentage identity does so poorly seem to be 
that it ignores information about gaps and about the conser- 
vative or radical nature of residue substitutions. 

From the PDB90D-B analysis in Fig. 3, we learn that 30% 
identity is a reliable threshold for this database only for 
sequence alignments of at least 150 residues. Because one 
unrelated pair of proteins has 43.5% identity over 62 residues, 
it is probably necessary for alignments to be at least 70 residues 
in length before 40% is a reasonable threshold, for a database 
of this particular size and composition. 

At a given reliability, scores based on percentage identity 
detect just a fraction of the distant homologs found by 
statistical scoring. If one measures the percentage identity in 
the aligned regions without consideration of alignment length, 
then a negligible number of distant homologs are detected. 
Use of the hssp equation improves the value of percentage 
identity, but even this measure can find only 4% of all known 
homologs at 1% EPQ. In short, percentage identity discards 
most of the information measured in a sequence comparison. 

Raw Scores. Smith-Waterman raw scores perform better 
than percentage identity (Fig. 1), but In-scaling (7) provided no 
notable benefit in our analysis. It is necessary to be very precise 
when using either raw or bit scores because a 20% change in 
cutoff score could yield a tenfold difference in EPQ. However, 
it is difficult to choose appropriate thresholds because the 
reliability of a bit score depends on the lengths of the proteins 
matched and the size of the database. Raw score thresholds 
also are affected by matrix and gap parameters. 

Statistical Scores. Statistical scores were introduced partly 
to overcome the problems that arise from raw scores. This 
scoring scheme provides the best discrimination between 
homologous proteins and those which are unrelated. Most 
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FlG. 5. Coverage vs. error plots of different sequence comparison methods: Five different sequence comparison methods are evaluated, each 
using statistical scores (E- or P-values). (A) PDB40D-B database. In this analysis, the best method is the slow SSEARCH. which finds 18% of relationships 
at 1% EPQ. fasta ktup = 1 and wu-BLAST2 are almost as good. (B) PDB90D-B database. The quick wu-BLAST2 program provides the best coverage 
at 1% EPQ on this database, although at higher levels of error it becomes slightly worse than fasta ktup = 1 and ssearch. 
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The availability of genome-scale DNA sequence information and reagents hasVadically altered life-science 
research. This revolution has led to the development of a new scientific subdiscipline derived from a combina- 
tion of the fields of toxicology and genomics. This subdiscipline. termed toxicogenomics. is concerned with the 
ioentification of potential human and environmental toxicants, and their putative mechanisms of action, through 
the use of genomics resources. One such resource is DNA microarrays or "chips." which allow the monitoring of 
the expression levels of thousands of genes simultaneously. Here we propose a general method by which gene 
expression, as measured by cDNA microarrays, can be used as a highly sensitive and informative marker for 
toxicity. Our purpose is to acquaint the reader with the development and current state of microarray technol- 
ogy and to present our view of the usefulness of microarrays to the field of toxicology. Mo/. Carcinog. 24:153* 
159, 1999. © i999Wiiey-Li$$. mc. 
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INTRODUCTION 

Technological advancements combined with in- 
tensive DNA sequencing efforts have generated an 
enormous database of sequence information over the 
past decade. To date, more than 3 million sequences, 
totaling over 2.2 billion bases [1), are contained 
within the GenBank database, which includes the 
complete sequences of 19 different organisms [2]. The 
r lrst complete sequence of a tree-living organism. 
Haemophilus influenzae, was reported in 1995 [3| and 
was followed shortly thereafter by the first complete 
sequence of a eukaryote, Saccharomyces cervisiae (4). 
The development of dramatically improved sequenc- 
ing methodologies promises that complete elucida- 
tion or the Homo sapiens DNA sequence is not far 
benind -51. 

To expioirmore ruilv the wealth ot new sequence 
information, it was necessary to develop novel meth- 
ods for the high-throughput or parallel monitoring 
of gene expression. Established methods such as 
northern blotting, RNAse protection assays, SI nu- 
clease analysis, plaque hybridization, and slot blots 
do not provide sufficient throughput to effectively 
utilize the new genomics resources. Newer methods 
such as differential display [6], high-density filter 
hybridization [7,8], serial analysis of gene expression 
[9], and cDNA* and oligonucleotide-based microarray 
"chip" hybridization (10-12) are possible solutions 
to this bottleneck. It is our belief that the microarray 
approach, which allows the monitoring of expres- 
sion levels of thousands of genes simultaneously, is 
a tool of unprecedented power for use in toxicology 
studies. 



Almost without exception, gene expression is al- 
tered during toxicity, as either a direct or indirect 
result of toxicant exposure. The challenge facing 
toxicologists is to define, under a given set of ex- 
perimental conditions, the characteristic and spe- 
cific pattern of gene expression elicited by a given 
toxicant. Microarray technology' offers an ideal plat- 
form for this type of analysis and could be the foun- 
dation for a fundamentally new approach to 
toxicology- testing. 

MICROARRAY DEVELOPMENT AND APPLICATIONS 

cDNA Microarrays 

In the past several years, numerous systems were 
developed for the construction of laree-scale DNA 
arravs. All ot these piattorms are based on cDNAs 
or oligonucleotides immobilized to a solid sup- 
port. In the cDNA approach, cDNA (or genomic) 
clones of interest are arrayed in a multi-well for- 
mat and amplified by polymerase chain reaction. 
The products of this amplification, which are usu- 
ally 500- to 2000-bp clones from the 3' regions of 
the genes of interest, are then spotted onto solid 
support by using high-speed robotics. By using, 
this method, microarrays of up to 10 000 clones 
can be generated by spotting onto a glass substrate 
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[13.141. Sample detection for microarrays on glass 
involves the use of probes labeled with fluores- 
cent or radioactive nucleotides. 

Fluorescent cDNA probes are generated from con- 
trol and test RNA samples in single-round reverse-tran- 
scription reac tions in the presence of fluorescentlv 
tagged dLTP (e.g.. Cy3-dLTP and Cy5-dLTP). which 
produces control and test products labeled with dif- 
ferent tluors. The cDNAs generated from these two 
populations, collectively termed the "probe." are then 
mixed and hybridized to the array under a glass cov- 
erslip [10.11.15]. The fluorescent signal is detected 
by using a custom-designed scanning confocal mi- 
croscope equipped with a motorized stage and lasers 
for iluor excitation ( 10, 1 1 . 15]. The data are analyzed 
with custom digital image analysis software that de- 
termines for each DNA feature the ratio of fluor 1 to 
fluor 2. corrected for local background [16,17]. The 
strength of this approach lies in the ability to label 
RNAs from control and treated samples with differ- 
ent fluorescent nucleotides, allowing for the simul- 
taneous hybridization and detection of both 
populations on one microarray. This method elimi- 
nates the need to control for hybridization between 
arrays. The research groups of Drs. Patrick Brown and 
Ron Davis at Stanford University spearheaded the 
effort to develop this approach, which has been suc- 
cessfullv applied to studies of Arabiciopsis thciliana 
RNA [10], yeast genomic DNA [15], tumorigenic ver- 
sus non-tumorigenic human tumor cell lines [11], 
human T-cells [18], yeast RNA [19], and human in- 
flammatory disease-related genes [20]. The most dra- 
matic result of this effort was the first published 
account of gene expression of an entire genome, that 
of the yeast Saccharomyces ctnisuie [21]. 

In an alternative approach, large numbers of cDNA 
clones can be spotted onto a membrane support, al- 
beit at a lower density [7.22]. This method is useful 
for expression profiling and large-scale screening and 
mapping of genomic or cDNA clones |7_22-24|. In 
expression proriiins on filter membranes, two uir- 
rerent membranes are used simultaneously for con- 
trol and test RNA hybridizations, or a single 
membrane" is stripped and reprobed. The signal is 
detected by using radioactive nucleotides and visu- 
alized by phosphorimager analysis or autoradiogra- 
phy. Numerous companies now sell such cDNA 
membranes and software to analyze the image data 
[25-27]. 

Oligonucleotide Microarrays 

Oligonucleotide microarrays are constructed either 
by spotting prefabricated oligos on a glass support 
[13] or by the more elegant method of direct in situ 
oiigo synthesis on the glass surface by photolithog- 
raphy [28-30]. The strength of this approach lies in 
its ability to discriminate DNA molecules based on 
single base-pair difference. This allows the applica- 
tion of this method to the fields of medical diagnos- 



tics, pharmacogenetics, and sequencing bv hvb n( j. 
ization as well as gene-expression analysis. 

Fabrication of oligonucleotide chips by photolu 
thography is theoretically simple but technically 
complex [29.30]. The light from a high-intensity 
mercury lamp is directed through a photolitho- 
graphic mask onto the silica surface, resulting i n 
deprotection of the terminal nucleotides in the.illu. - 
minated regions. The entire chip is then reacted with 
the desired free nucleotide, resulting in selected chain 
elongation. This process requires only 4n cvcles 
(where n = oligonucleotide length in bases) to* syn- 
thesize a vast number of unique oligos. the total num- 
ber of which is limited only by the compiexirv of the 
photolithographic mask and the chip size [29 3 l.32|. 

Sample preparation involves the generation of 
double-stranded cDNA from cellular poly(A>+ RNA 
followed by antisense RNA synthesis in an in vitro 
transcription reaction with biotinylated or fluor- 
tagged nucleotides. The RNA probe is then frag- 
mented to facilitate hybridization. If the indirect 
visualization method is used, the chips are incubated 
with fluor-linked streptavidin te.j;., phycoerythrin) 
after hybridization [12.33]. The signal is detected with . 
a custom confocal scanner [34|. This method has 
been applied successfully to the mapping of genomic 
library clones [35], to de novo sequencing by hybrid- 
ization [28.36], and to evolutionary sequence com- 
parison of the BRCA1 gene [37|. In addition, 
mutations in the cystic fibrosis |38| and BRCA1 [39] 
gene products and polymorphisms in the human im- 
munodeficiency virus- 1 clade B protease gene [40| 
have been detected by this method. Oligonucleotide 
chips are also useful for expression monitoring |33] 
as has been demonstrated by the simultaneous evalu- 
ation of gene-expression patterns in nearly all open 
reading frames of the yeast strain 5. ctrtvisiae [12J. 
More recently, oligonucleotide chips have been used 
to help identify single nucleotide polymorphisms in 
the human 14 i| and yeast |42| cenomes. 

THE USE OF MICROARRAYS IN TOXICOLOGY 

Screening for Mechanism of Action 

The field of toxicology uses numerous in vivo 
model systems, including the rat. mouse, and rab* 
bit. to assess potential toxicity and these bioassays 
are the mainstay of toxicology testing. However, in 
the past several decades, a plethora of in vitro tech- 
niques have been developed to measure toxicity, 
many of which measure toxicant-induced DNA dam- 
age. Examples of these assays include the Ames test 
the Syrian hamster embryo cell transformation as- 
say, micronucleus assays, measurements of sister 
chromatid exchange and unscheduled DNA synthe* 
sis. and many others. Fundamental to all of these 
methods is the fact that toxicity is often preceded 
by. and results in. alterations in gene expression. In 
many cases, these changes in gene expression are a 
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►far more sensitive, characteristic, and measurable 
" e n'dp£> int than the toxicitv itself * VVe therefore pro- 
-pose that a method based on measurements of the 
genome-wide gene expression pattern of an organ- 
ise after toxicant exposure is fundamentally intor- 
- m :: jve and complements the established methods 
described above. 

VVe are developing a method by which toxicants 
can be identified and their putative mechanisms of 
action determined by using toxicant-induced gene ex- 
pression profiles. In this method, in one or more de- 
fined model systems, dose and time-course parameters 
are established for a series of toxicants within a given 
prototvpic class (e.g.. polycyclic aromatic hydrocar- 
bon ( PAHs)). Cells are then treated with these agents 
a; j fixed toxicity level tas measured by cell survival), 
RNA is harvested, and toxicant-induced gene expres- 
sion changes are assessed by hybridization to a cDNA 
microarray chip ( Figure 1 ). We have developed a cus- 
" torn DNA chip, called ToxChip vl.O, specifically for 
this purpose and will discuss it in more detail below. 
The changes in gene expression induced by the test 
aeents in the model systems are analyzed, and the 
common set of changes unique to that class oi toxi- 
,nts. termed a toxicant signature, is determined. 
This signature is derived by ranking across all ex- 
periments the gene-expression data based on rela- 

Control 
Population 



tive fold induction or suppression of genes in treated 
samples versus untreated controls and selecting the 
most consistently different signals across the sampie 
set. A different signature may be established ror each 
prototvpic toxicant class. Once the signatures are de- 
termined, gene-expression profiles induced by un- 
known agents in these same model systems can then 
be compared with the established signatures. A match 
assigns a putative mechanism of action to the test 
compound. Figure 2 illustrates this signature method 
for different types of oxidant stressors. PAHs. and 
peroxisome proliferators. In this example, the un- 
known compound in question had a gene-expres- 
sion profile similar to that of the oxidant stressors in 
the database. VVe anticipate that this general method 
will also reveal cross talk between different pathways 
induced by a single agent ie.g., reveal that a com- 
pound has both PAH-like and oxidant-like proper- 
ties). In the future, it may be necessary to distinguish 
very subtle differences between compounds within 
a very large sample set (e.g.. thousands of highly simi- 
lar structural isomers in a combinatorial chemistry 
library or peptide library). To generate these highly 
refined signatures, standard statistical clustering tech- 
niques or principal-component analysts can be used. 

For the studies outlined in Figure 2. we developed 
the custom cDNA microarray chip ToxChip vl.O. 

Treated 
Population 
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Figure 1. Simplified overview of the method for sample 
preparation and hybridization to cDNA microarray*. For illus- 



trative purposes, samples derived from cell culture are depicted, 
although other sample types are amenable to this analysis. 
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Figure 2. Schematic representation of the method for iden- 
tification of a toxicant's mechanism of action. In this method, 
gene-expression data derived from exposure of model sys- 
tems to known toxicants are analyzed, and a set of changes 
characteristic to that type of toxicant (termed the toxicant 
signature) is identified. As depicted, oxidant stressors produce 



consistent changes in group A genes (indicated by red and 
green circles), but not group 8 or C genes (indicated by gray 
circles). The set of gene-expression changes elicited by the 
suspected toxicant is then compared with these characteristic 
patterns, and a putative mechanism of anion is assigned to 
the unknown agent. 



The 2090 human genes that comprise this subarrav 
were selected for their well-documented involve- 
ment in basic cellular processes as well as their re- 
sponses to different types of toxic insult. Included 
on this list are DNA replication and repair genes, 
apoptosis genes, and genes responsive to PAHs and 
dioxin-like compounds, peroxisome proliferators, 
estrogenic compounds, and oxidant stress. Some of 
the other categories of genes include transcription 
factors, oncogenes, rumor suppressor genes, cvclins. 
kinases, pnosphatases. cell adhesion and motility 
genes, and homeobox genes. Also included in this 
group are 84 housekeeping genes, whose hybridiza- 
tion intensity is averaged and used for signal nor- 
malization of the other genes on the chip. To date, 
very few toxicants have been shown to have appre- 
ciable effects on the expression of these housekeep- 
ing genes. However, this housekeeping list will be 
revised if new data warrant the addition or deletion 
of a particular gene. Table 1 contains a general de- 
scription of some of the different classes of genes 
that comprise ToxChip vl.O. 

When a toxicant signature is determined, the 
genes within this signature are flagged within the 
database. When uncharacterized toxicants are then 
screened, the data can be quickly reformatted so that 
blocks of genes representing the different signatures 



are displayed [11]. This facilitates rapid, visual in- 
terpretation of data. We are also developing Tox- 
Chip v2.0 and chips tor other model systems, 
including rat. mouse, Xenopus. and yeast, for use :n 
toxicology studies. 

Animal Models in Toxicology Testing 

The toxicology community relies heavily on the 
use or animals as model svstems tor toxicology test- 
in;. Unrominateiv. these assavs are mnerentlv ex- 
pensive, require large numbers or animals and take a 
long time to complete and analyze. Therefore, the 
National Institute of Environmental Health Sciences 
(N1EHS). the National Toxicology Program, and the 
toxicology community at large are committed to re- 
ducing the number of animals used, by developing 
more efficient and alternative testing methodologies. 
Although substantial progress has been made in the 
development of alternative methods, bioassays are 
still used for testing endpoints such as neurotoxic- 
ity, immunotoxicity, reproductive and developmen- 
tal toxicology, and genetic toxicology. The rodent 
cancer bioassay is a particularly expensive and time- 
consuming assay, as it requires almost 4 yr, 1200 
animals, and millions of dollars to execute and ana- 
lyze [43J. In vitro experiments of the type outlined 
in Figure 2 might provide evidence that an unknown 
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Table 1. ToxChip v1.0: A Human cDNA Microarray 
Vhio Designed to Detea Responses to Toxic Insult 



No. of genes 

Genecareaorv op. nip 



^pcotosis 72 

ON- reDiication and reoair 99 

Oxioative stress/reoe*-BomeosTasis 90 

Peroxisome proiiferator resoonsive 22 
Dtoxm/PAH resoonsive 1 2 

Estrogen resoonsive 63 

Houseiceeoing 84 

Oncogenes and tumor suooressor genes 76 
Cell-cyc:e control 51 

Transcription factors 131 

Kinases 276 
P— .onatases 88 
He-j'.-snock proteins 23 

Receotors 349 

Cvtocnrome P450s 30 



•This list is intended as a general cuide. The gene categories are not 
uniQue. ana some genes are usieo in multiple categories. 

agent is (or is not) responsible for eliciting a given 
biological response. This information would help to 
s^ect a bioassay more specifically suited to the agent 
i: question or perhaps suggest that a bioassay is not 
necessary, which would dramatically reduce cost, 
animal use. and time. 

The addition of microarray techniques to stan- 
dard bioassays may dramatically enhance the sen- 
sitivity and interpretability of the bioassay and 
possibly reduce its cost. Gene-expression signatures 
could be determined for various types of tissue-spe- 
cific toxicants, and new compounds could be 
c -eened for these characteristic signatures, provid- 
ing a rapid and sensitive in vivo test. Also, because 
gene expression is often exquisitely sensitive to low 
doses of a toxicant, the combination of gene-expres- 
sion screening and the bioassay might allow the use 
of lower toxicant doses, which are more relevant to 
human exposure levels, and the use of fewer ani- 
mais. in addition, sene-expression chances are nor- 
mally measured in hours or days, not in the months 
to years required tor tumor development. Further- 
more, microarrays might be particularly useful for 
investigating the relationship between acute and 
chronic toxicity and identifying secondary effects 
of a given toxicant by-studying the relationship 
between the duration of exposure to a toxicant and 
the gene-expression profile produced. Thus, a bio- 
assay that incorporates gene-expression signatures 
with traditional endpoints might be substantially 
shorter, use more realistic dose regimens, and cost 
substantially less than the current assays do. 

These considerations are also relevant for branches 
of toxicology not related to human health and not 
. using rodents as model systems, such as aquatic toxi- 
cology and plant pathology. Bioassays based on the 
flathead minnow, Daphnia, and Arabadopsis could 



also be improved by the addition of microarrav analy- 
sis. The combination of microarrays with traditional 
bioassays might also be useful for investigating some 
of the more intractable problems in toxicology re- 
search, such as the effects of complex mixtures and 
the difficulties in cross-species extrapolation. 

Exposure Assessment, Environmental Monitoring, 
and Drug Safety 

The currently used methods for assessment of ex- 
posure to chemical toxicants are based on measure- 
ment of tissue toxin levels or on surrogate markers . 
of toxicity, termed biomarkers (e.g., peripheral blood 
levels of hepatic enzymes or DNA adducts). Because 
gene expression is a sensitive endpoint. gene expres- 
sion as measured with microarray technology* may 
be useful as a new biomarker to more precisely iden- 
tify hazards and to assess exposure. Similarly, 
microarrays could be used in an environmental- 
monitoring capacity to measure the effect of poten- 
tial contaminants on the gene-expression profiles 
of resident organisms. In an analogous fashion, 
microarrays could be used to measure gene-expres- 
sion endpoints in subjects in clinical trials. The com- 
bination of these gene-expression data and more 
established toxic endpoints in these trials could be 
used to define highly precise surrogates of safety. 

Gene-expression profiles in samples from exposed 
individuals could be compared to the profiles of the 
same individuals before exposure. From this infor- 
mation, the nature of the toxic exposure can be de- 
termined or a relative clinical safety factor estimated. 
In the future it may also be possible to estimate not 
only the nature but the dose of the toxicant for a 
given exposure, based on relative gene-expression 
levels. This general approach may be particularly 
appropriate for occupational-health applications, in 
which unexposed and exposed samples from the 
same individuals may be obtainable. For example, 
a pilot study of gene expression in peripheral-blood 
Ivmphocvtes of Polish coke-oven workers exposed 
:o PAHs i ana many otner compounds » is under con- 
sideration arthe NIEHS. An important consideration 
for these types of studies is that gene expression can 
be affected by numerous factors, including diet, 
health, and personal habits. To reduce the effects 
of these confounding factors, it may be necessary 
to compare pools of control samples with pools of 
treated samples. In the future it may be possible to 
compare exposed sample sets to a national database 
of human-expression data, thus eliminating the 
need to provide an unexposed sample from the same 
individual. Efforts to develop such a national gene- 
expression database are currently under way [44,45). 
However, this national database approach will re- 
quire a better understanding of genome-wide gene 
expression across the highly diverse human popu- 
lation and of the effects of environmental factors 
on this expression. 
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Alleles, Oligo Arrays, and Toxicogenetics 

Gene sequences vary between individuals, and 
this var.ab.hty can be a causative factor in human 
diseases of environmental origin [46.47]. A new area 
of toxicology, termed toxicogenetics, was recently 
developed to study the relationship between genetic 
vanabilitvand toxicant susceptibility. This field is 
not the subject of this discussion, but it is worth- 
while to note that the ability of oligonucleotide ar- 
rays to discriminate DNA molecules based on sinele 
base-pair differences makes these arravs uniquely 
useful for this type of analysis. Recent reports dem- 

tZ"™LT f * ismiT y of this ^Proach (41,42). 
The MEHS has initiated the Environmental Genome 
Proiect to identify common sequence polvmor- 
phisms in 200 genes thought to be involved in en- 
vironmental diseases [48]. In a pilot studv on the 
feasibility of this application to the Environmental 
Genome Project, oligonucleotide arravs will be used 
to resequence 20 candidate genes. This toxicogenetic 
approach promises to dramatically improve our un- 
derstanding of interindividual variability in disease 
susceptibility. 

FUTURE PRIORITIES 
There are many issues that must be addressed be- 
fore the full potential of microarravs in toxicology 
research can be realized. Among these are model svs- 
em selection, dose selection, and the temporal na- 
ture of gene expression. In other words, in which 
spec.es. at what dose, and at what time do we look 
for toxicant-induced gene expression? If human 
samples are analyzed, how variable is global eene 
expression between individuals, before and after toxi- 
cant exposure? What are the effects of age. diet, and 
other factors on this expression? Experience, in the 
form of large data sets of toxicant exposures, will 
answer these questions. 

One of the most pressing issues for arrav scientists 
is the construction of a national public database 
•linked to the existing public databases) to serve as a 
repository tor gene-expression data. This relational 
database must be made available for public use, and 
researchers must be encouraged to submit their ex- 
pression data so that others may view and query the 

3 Re$earchers at th < National Institutes 

„ H * alth-havennade laudable progress in develop- 
mg the first generation of such a database [44,451 In 
addition, improved statistical methods for gene clus- 
!? n it a ", ""Sniton are needed to ana- 

lyze the data in such a public database 

The pxoliferation of different platforms and meth- 
«m„?K m, £° array b y b » d *«tions will improve 
sample handling and data collection and analysis and 

ZiXL CO *\ H J WeV "- the variet y of "licVoarray 
2£ a * *«! ««• Problems of data com- 
patibiiity between platforms. In addition, the near- 
infinite variety of experimental conditions under 



NUWAYM rr AL. 



which data will be collected bv different hh„, 

ricult. To help circumvent these future probi— ? 
set of standards to be included on all p| a - a 
should be established. These standards woulc - * 
tate data entry into the national database anc -"* 
as reference points for cross-platform and inter- 
ratory data analysis. ' 

Many issues remain to be resolved, but it is clear 
that new molecular techniques such as microarrav 
hybridization will have a dramatic impact on to*™' 
ogy research. In the future, the information gathSS 
from microarray-based hybridization experiments S 
form the basis for an improved method asse 
impact of chemicals on human and env.ronme^ 
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Application of DNA Arrays to Toxicology 

John C. Rockett and David J. Dix 



Reproductive Toxicology Division. National Health and Environmental Effects Research Laboratory. U.S. Environmental Protection. 
Agency. Research Triangle Park. North Carolina. USA 



DNA amy technology nukes it possible to rapidly genotype individuals or quantify the expression 
of thousands of genes on a single filter or glass slide, and holds enormous potential in toxicologic 
applications. This potential led to a U.5. Environmental Protection Agency-sponsored workshop 
tided "Application of Microarrays to Toxicology" on 7-8 January 1999 in Research Triangle Park. 
North Carolina. In addition to providing state-o f-the-art information on the application of DNA or 
gene microarrays. the workshop catalyzed the formation of several collaborations, committees, and 
user's groups throughout the Research Triangle Park area and beyond. Potential application of 
microarravs to toxicologic research and risk assessment include genome-wide expression analyses to 
identify gene-expression networks and toxicant-specific signatures chat can be used to define mode 
of action, for exposure assessment, and for environmental monitoring. Arrays may also prove useful 
for monitoring genetic variability and its relationship to toxicant susceptibility in human popula- 
tions. Key word* DNA arrays, gene arrays, microarrays, toxicology. Environ Health Pertpeer 
107:681-685 (1999). [Online 6 July 19991 
http:/UhpnetLnuhs.nih.govJdoei/1999/I07p68l^5ro<ken/ 



Decoding the genetic blueprint is a dream that 
offers manifold returns in terms ot understand- 
ing how organisms develop and function in an 
orten hostiic environment. With the rapid 
advances in molecular biology over the last 30 
vears, the dream has come a step closer to reali- 
rv. Molecular biologists now have the ability to 
elucidate the composition of any genome. 
Indeed, almost 20 genomes have already been 
sequenced and more than 60 are currently 
under way. Foremost among these is the 
Human Genome Mapping Protect. However, 
the genomes of a number of commonly used 
laboratory species are also under intensive 
investigation, including yeast. Arabidopsis. 
maize, rice, zebra fish, mouse, rat. and dog. It 
is widely expected that the completion of such 
programs will facilitate the development of 
manv powerful new techniques and approach- 
es to diagnosing ana treating geneucailv and 
environmentally induced ciiseasct which amict 
mankind. However, the vast amount of data 
being generated by genome mapping will 
require new high-throughput technologies to 
investigate the function of the millions of new 
genes that are being reported Among the most 
widely heralded of the new functional 
genomics technologies are DNA arrays, which 
represent perhaps the most anticipated new 
molecular biology technique since polymerase 
chain reaction (PCR). 

Arrays enable the study of literally thou- 
sands of genes in a single experiment. The 
potential importance of arrays is enormous and 
has been highlighted by the recent publication 
of an entire Nature Genetic supplement dedi- 
cated to the technology (/). Despite this huge 
surge of interest. DNA arrays are still little used 
and largely improver, as demonstrated by the 
high ratio of review and press articles to actual 
data papers. Even so. the potential they offer 



has driven venture capitalists into a frenzy of 
investment and many new companies are 
springing up to claim a share of this rapidly 
developing market. 

The U.S. Environmental Protection 
Agency (EPA) is interested in applying DNA 
array technology to ongoing toxicologic stud- 
ies. To learn more about the current state of 
the technology, the Reproduaive Toxicology 
Division (RTD) of the National Health and 
Environmental Effects Research Laboratory 
(NHEERL: Research Triangle Park. NC) 
hosted a workshop on "Application of 
Microarrays to Toxicology" on 7-8 January 
1999 in Research Triangle Park. North 
Carolina. The workshop was organized by 
David Dix. Robert Kavlock. and John Rockett 
of the RTD/NHEERL Twenty-rwo intra- 
mural and extramural scientists from govern- 
ment, acaoejiua. and industry shared inrbrma- 
tion. data, and opinions on the current and 
future applications for this cxaong new tech- 
nology. The workshop had more than 150 
attendees, including researchers, students, and 
-administrators from the EPA, the. National 
Institute of Environmental Health Sciences 
(NIEHS). and a number of other establish- 
ments from Research Triangle Park and 
beyond Presentations ranged from the tech- 
nology behind array production through the 
sharing of actual experimental data and projec- 
tions on the future importance and applica- 
tions of arrays. The irifbrmation contained in 
the workshop presentations should provide aid 
and insight into arrays in general and their 
application to toxicology in particular. 

Array Elements 

In the context of molecular biology, the word 
"array" is normally used to refer to a series of 
DNA or protein dements firmly attached in 



a regular partem to some kind oi supportive 
medium. DNA arrav ts often used inter- 
chaneeablv with gene arrav or microarray. 
Although not formal iv denned, microarrav is 
generally used to describe the higher density 
arrays typically printed on glass chips. The 
DNA elements that make up DNA arravs 
can be oligonucleotides, partial gene 
sequences, or full- length cDNAs. Comoanies 
offering pre-made arravs that contain less 
than mil -length clones normally use regions 
of the genes which are specific to that gene to 
prevent false positives arising through cross- 
hybridization. Sequence verification of 
cDNA clone identity is necessary because of 
errors in identifying specific clones from 
cDNA libraries and databases. Premade 
DNA arrays printed on membranes are cur- 
rently or imminently available for human, 
mouse, and rat. In most cases they contain 
DNA sequences representing several thou- 
sand different sequence clusters or genes as 
delineated through the National Center for 
Biotechnology Information UniGene Project 
(J). Many of these different UniGene dusters 
(putative genes) are represented only by 
expressed sequence tags (ESTs). 

Array Printing 

Arrays are typically printed on one of two 
types of support matrix. Nylon membranes 
are used by most off-the-shelf array providers 
such as Clontech Laboratories. Inc. 
(Palo Alto. GA). Genome Systems, Inc. (St. 
Louis. MO), and Research Genetics. Inc. 
(Huntsville. AL). Microarravs such as those 
produced by Ammetnx. inc. (Sana Clara. 
CAi. Incyte Pharmaceuucais, Inc. (Palo Alto. 
C\). and many do-it-yourself (DIY) arraying 
groups use glass waters or slides. Although 
standard microscope slides may be used, they 
must, be preprepared to facilitate sucking 
of the DNA to the glass. Several different 
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coatings have been successfully used, includ- 
ing silane and lysine. The coating of slides 
can easily be carried oui in the laboratory, 
but many prefer the convenience of precoatec 
slides available from suppliers. 

Once the support matrix has been pre- 
pared, the DNA elements can be applied by 
several methods. AfTymetrix. Inc. has devel- 
oped a unique photolithographic technology 
for attaching oligonucleotides to glass wafers. 
More commonly. DNA is applied by either 
noncontact or contact printing. No neon tact 
printers can use thermal, solenoid, or piezoelec- 
tric technology to spray aliquot* of solution 
onto the support matrix and may be used to 
produce slide or membrane-based arrays. 
Cartesian Technologies, Inc. (Irvine, CA) has 
developed nQUAD technology for use in its 
PixSys printers. The system couples a syringe 
pump with the microsoienoid valve, a combi- 
nation that provides rapid quantitative dispens- 
ing of nanoliter volumes (down to h2 nL) over 
a variable volume range. A different approach 
to noncontact printing uses a solid pin and ring 
combination (Genetic MicroSystems. Inc., 
Wobum, MA). This system (Figure U allows a 
broader range of sampie. including cell suspen- 
sions and particulates, because the printing 
head cannot be . blocked up in the same way as 
a spray nozzle. Fluid transfer is controlled in 
this system primarily by the pin dimensions 
and the force of deposition, although the 
nature of the support matrix and the sample 
will also affect transfer to some degree. 

In contact printing, the pin head is dipped 
in the sample and then touched to the support 
matrix to deposit a small aliquot. Split pins 
were one of the first contact-printing devices 
to be reported and are the suggested format 
for Dry arrayers. as described bv Brown (3). 
Split pins are small metal pins with a precise 
groove cut vertically in the middle of the pin 
tip. In this system. 1^8 spiit pins are posi- 
tioned in the pin-head Tne split pins work bv 
simpie capillary action, not unlike a fountain 
pen — when the pin heads are dipped in the 
sample, liquid is drawn into the pin groove. A 
small (fixed) volume is then deposited each 
time the split pins are gently touched to 
the support matrix. Sample (100-500 pL 
depending on a variety of parameters) can be 
deposited on multiple slides before refilling is 
required, and array densities of > 2.500 
spots/cm 2 may be produced. The deposit vol- 
ume depends on the split size, sample fluidi- 
ty, and the speed of printing. Split pins are 
relatively simple to produce and can be made 
in-house if a suitable machine shop is avail- 
able. Alternatively, they can be obtained 
directly from companies such as TeleChem 
International, Inc. (Sunnyvale, CA). 

Irrespective of their source, printers 
should be run through a preprint sequence 
prior to producing the actual experimental 



arrays: the first 100 or so spots of a new run 
tend to be somewhat variable. Factors effect- 
ing spot reproducibility include slide treat- 
ment homogeneity, sampie differences, and 
instrument errors. Other factors inn come 
into play include clean eiection of the drop 
and clogging InQL'AD printing) and 
mechanical variations and long-term alter- 
ation in print-head surface of solid and split 
pins. However, with careful preparation it is 
possible to get a coefficient of variance for 
spot reproducibility below 1 0%. 

One potential printing problem is sample 
carryover. Repeated washing, blotting, and 
drying (vacuum) of print pins between samples 
is normally effective at reducing sample carry- 
over to negligible amounts. Printing should 
also be carried out in a controlled environ- 
ment. Humidified chambers are available in 
which to place printers. These help prevent 
dust contamination and produce a uniform 
drying rate, which is important in determining 
spot size, quality, and reproducibility. 

In summary, although several printing 
technologies are available, none are par- 
ticularly outstanding and the bottom line 
is that they are still in a relatively earlv stage 
of evolution. 

Array Hybridization 

The hybridization protocol is, practically 
speaking, relatively straightforward and those 
with previous experience in blotting should 
have little difficulty. Array hybridizations 
are, in essence, reverse Southern/Northern 
blots — instead of applying a labeled probe to 
the target population of DNA/RNA. the 
labeled population is applied to the probets). 
With membrane-based arrays, the control and 
treated mRNA populations are normallv con- 
vened to cDNA and labeled with isotope (e.g., 
-P) in the process. These labeled populations 
are men hybridized ixuieoendentiv to oarailei 
or senai arrays and the hvbricuzaaon sicrui is 
d erected with a phosporimager. A less com- 
monly used alternative to radioactive probes is 
enzymatic detection. The probe may be 
biotinylated, haptenylated,_or have alkaline 
phosphatase/horseradish peroxidase attached. 
Hybridization is detected by enzymatic reac- 
tion yielding a color reaction (4). Differences 
in hybridization signals can be detected by eye 
or, more accurately, with the help of digital 
imaging and commercially available software. 
The labeling of the test populadons for slide- 
based microarrays uses a slightly different 
approach. The probe typically consists of two 
samples of polyA* RNA (usually from a treated 
and a control population) chat are converted -to 
cDNA; in the process each is labeled with a 
different fluor. The independently labeled 
probes are then mixed together and hybridized 
to a single rnicroarray slide and the resulting 
combined fluorescent signal is scanned. After 




Figure 1. Genetic Microsystems (Wobum. MA) pm 
ring system tor printing arrays. The pin ring com- 
bination consists of a circular open ring oriented 
parallel to the samoie solution, with a vertical pin 
centered over the ring. When the ring ts dipped 
into a solution and lifted, it withdraws an aliquot 
of sample held by surface tension. To spot the 
sample, the pm is driven down through the ring 
and a portion of the solution is transferred to the 
bottom of the pin. The pin continues to move 
downward until the pendant drop of solution 
makes contact with the underlying surface. The 
pin is then lifted, and gravity and surface tension 
cause deposition of the spot onto the array. 
Figure from Flowers et al. (/4), with permission 
from Genetic Microsystems. 

normalization, it is possible to determine the 
ratio of fluorescent signals from a single 
hybridization of a slide-based rnicroarray. 

cDNA derived from control and treated 
populations of RNA is most commonly 
hybridized to arrays, although sub-tractive 
hybridization or differential display reactions 
mav also be used. Fluorophore- or radiola- 
beied nucieouoes are directly incorporated 
into the cDNA in the process of converting 
RNA to cDNA. Alternatively, 5' end-labeled 
primers mav be used for cDNA synthesis 
These are labeled with a fluorophore for 
direct visualization of the hybridized array. 
Alternatively, biotin or a hapten may be 
attached to the primer, in which case fluor- 
labeled streptavidin or antibody must be 
applied before a signal can be generated. The 
most commonly used fluorophores at present 
are cyanine (Cy)3 and Cy5 (Amersham 
Pharmacia Biotech AB. Uppsala, Sweden). 
However, the relative expense of these fluo- 
rescent conjugates has driven a search for 
cheaper alternatives. Fluorescein, ihodamine, 
and Texas red have all been used, and 
companies such as Molecular Probes, Inc. 
(Eugene, OR) are developing a series of 
labeled nucleotides with a wide range of exci- 
tation and emission spectra which may prove 
to function as well as the Cy dyes. 
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Table 1. Advantages and disadvantages of different microarray scanning systems 
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Analysis of DNA Microarrays 

Membrane-based arrays arc normally analyzed 
on film or with a pnosphonmager. whereas 
chip-based arrays require more specialized scan- 
rung devices. These can be divided into three 
main groups: the charge-coupled device camera 
systems, the norKonrbcal laser scanners, and the 
con/col laser scanners. The advantages and dis- 
advantages of each system are listed in Tabic 1 . 

Because a typical spot on a microarray can 
contain > 10 8 molecules, it is dear that a large 
variation in signal strength may occur. 
Current scanners cannot work across this 
many orders of magnitude (4 or 5 is more typ- 
ical). However, the scanning parameters can 
normally be adjusted to collect more or less 
sienal. such that rwo or three scans of the same 
arrav should permit the detection oi rare and 
abundant genes. 

When a microarray is scanned, the fluores- 
cent images art captured by somvare normally 
included with the scanner. Several commercial 
suppliers provide additional software for quan- 
tifying array images, but the software tools are 
constantly evolving to meet the developing 
needs of researchers, and it is prudent to 
define one's own needs and clarify' the exact 
capabilities of the software before its purchase. 
Issues that should be considered include the 
following: 

• Can the software locate offset spots? 

• Can it quantitate across irregular hybridiza- 
tion signals? 

• Can the arrayed genes be programmed in for 
easy identification and location? 

• Can the software connect via the Internet to 
databases containing further information on 
the gene(s) of interest? 

One of the key issues raised at the work- 
shop was the sensitivity of microarray technol- 
ocy. Experiments by General Scanning. Inc. 
\C aterown. MA), have shown that by using 
the Cy dyes and their scanner, signal can be 
detected down to levels of < 1 fluor molecule 
per square micrometer, which translates to 
detecting a rare message at approximately one 
copy per cell or less. 

Array Applications 

Although arrays are an emerging technology 
certain to undergo improvement and 
alteration, they have already been applied use- 
fully to a number of model systems. Arrays are 
at their most powerful when they contain the 
entire genome of the species they are being 
used to study. For this reason, they have strong 
support among researchers utilizing yeast and 
Qienorhabditis eltgans (5). The genomes of 
both of these species have been sequenced and 
in the case of yeast, deposited onto arrays for 
examination of gene expression (6,7). With 
both of these species, it is relatively easy to 
perturb individual gene expression. Indeed. C 
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elegant knockouts can be made simply by 
soaking the worms in an antisense solution of 
the gene to be knocked out. 

By a process of systematic gene disrup- 
tion, it is now possible to examine the cause 
and effect relationships between different 
genes in these simple organisms. This kind of 
approach should help elucidate biochemical 
pathways and genetic control processes, 
deconvolute polygenic interactions, and 
define the architecture of the cellular network. 
A simple case study of how this can be 
achieved was presented by Butow [University 
of Texas Southwestern Medical Center. 
Dallas, TX (Figure 2)1. Although it is the 
phenorypic result of a single gene knockout 
that is being examined, the effect of such 
perturbation will almost always be polygenic. 
Polygenic interactions will become increasing- 
ly important as researchers begin to move 
away from single gene systems when examin- 
ing the nature of toxicologic responses to 
external stimuli. This is especially important 
in toxicology because the phenorypc pro- 
duced by a given environmental insult is 
never the result of the action of a single gene; 
rather, it is a complex interaction ot one or 
multiple fpllnlar pathways. Phenomena such 
as quantitative trait tthe continuous variation 
of pheaorype), episxasis (the erred of aiieies or 
one or more genes on the expression or other 
genes), and penetrance tproportion of indi- 
viduals of a given genotype that display a par- 
ticular phenorypc) will become increasingly 
evident and important as toxicologists push 
toward the ultimate goal of matching the 
responses of individuals to different 
environmental stimuli. 

Analysis of the transcriptome (the expres- 
sion level of all the genes in a given cell popula- 
tion) was a use of arrays addressed by several 
speakers. Unfortunately, current gene nomen- 
clature is often confusing in that single genes 
are allocated multiple names (usually as a result 
of independent cUscovery by differ en t laborato- 
ries), and there was a call for standardization of 
gene nomenclature. Nevertheless, once a tran- 
scriptome has been assembled it can then be 
QansfeTTcd onto arrays and used to screen any 
chosen system. The EPA MicroArray 
Consortium (EPAMAQ is assembling testes 



transcriptomes for human, rat. and mouse. In 2 
slightly different approach. Nuwavsir ct ai. icT) 
describes how the NIEHS assembied what is 
effectively a "toxicoioeical transenptome" — a 
lib ran* of human and mouse genes that have 
previously been proven or implicated in 
responses to toxicologic insults. Clontech 
Laboratories. Inc. (Palo Alto. CA). has begun a 
similar process by developing stress/toxicology 
filter arrays of rat. mouse, and human genes. 
Thus, rather than being tissue or cell specific 
these stress/ toxicology arrays can be used across 
a variety of model systems to look for alter- 
ations in the expression of toxicologically 
important genes and define the new field of 
toxicogenomics. The potential to identify toxi- 
cant families based on tissue- or cell-specific 
gene expression could revolutionize drug test* 
ing. These molecular signatures or fingerprints 
could not only point to the possible 
toxicity/carcinogenicity of newly discovered 
compounds (Figure 3). but also aid in elucidat- 
ing their mechanism of action through identifi- 
cation of gene expression networks. By exten- 
sion, such signatures could provide easily iden- 
tifiable biomarkcrs to assess the degree, time, 
and nature of exposure. 

DNA arravs are primarily a tool for exam- 
ining differential gene expression in a given 
model. In this context tnev are rexerred to as 
dosed systems because tnev lack the ability of 
other difrerenoal expression technologies, eg., 
differential display and subtractive hybridiza- 
tion, to detect previously unknown genes not 
present on the array. This would appear to 
limit the power of DNA arrays to the imagina- 
tions and preconceptions of the researcher in 
selecting genes previously characterized and 
thought to be involved in the model system. 
However, the various genome sequencing pro- 
jects have created a new category of 
sequence — the EST — that has partially molli- 
fied this deficiency. ESTs are cDNAs expressed 
in a given tissue that, although they may share 
some degree of sequence similarity to previous- 
ly characterized genes, have not been assigned 
specific genetic identity. By incorpo raring EST 
dones into an array, it is possible to monitor 
the expression of these unknown genes. This 
can enable the identification of previously 
uncharacterized genes that may have biologic 



significance in the model system. Filter arrays 
from Research Genetics and slide arrays rrom 
Incyte Pharmaceuticals both incorporate large 
numbers of ESTs rrom a variety or species. 

A further use of miaoanays is the identifi- 
cation of single nucleotide polymorphisms 
•;SNPs:. These genomic variations are abun- 
dant — tnev occur approximately even* 1 leb or 
so— and arc the basis of restriction rragment 
length poivmorphism anaiysis used in forensic 
anaivsis. Anymetnx. Inc. designed chips that 
contain multiple repeats of the same gene 
seouence. Each position is present with aJl four 
possible bases. After the hybridization or the 
sample, the degree of hybridization ro the dif- 
ferent sequences can be measured and die exact 
sequence of the target gene deduced. SN'Ps are 
thought to be of vital importance in drug 
metabolism and toxicology. For example, sin* 
gle base differences in the regulatory region or 
active site of some genes can account for huge 
differences in the activiry of that gene. Such 
SNPs are thought to explain why some people 
are abie to metabolize certain xenobiotics bet- 
ter than others. Thus, arrays provide a further 
tool for the toxicologist investigating the 
nature of susceptible subpopulations and toxi- 
cologic response. 

There are still many wrinkles to be ironed 
out before arrays become a standard tool for 
toxicologists. The main issues raised at the 
workshop by those with hands-on experience 
were the following: 

• Expense: the cost of purchasing/contracting 
this technology is still too great for many 
individual laboratories. 
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Figure 2. Potential effects of gene knockout within 
positively and negatively regulated gene expression 
networks. /, is limning in wild type for expression of 
^ \A\ A simple, two-component, linear regulatory 
network operating on gene where /, is a positive 
effector of and j n is either a positive or negative 
effector of This network could be deduced by 
examining the consequence of (0) deleting j n on the 
expression of /, and ^ where the expression of L 
would be decreased or increased depending on 
whether j n was a positive or negative regulator. 
These and other connected components of even 
greater complexity could be revealed by genome- 
wide expression a narysis. From Butow U5I. 



• Qones: the logistics of identifying, obtaining, 
and maintaining a set of nonredundam. non- 
contaminated, sequence-verified, species/ceil' 
tissue/ rield-spccinc ciones. 

• Use or inbred strains: where whole-organism 
models are being used, the use of inbred 
strains is important to reduce the potentuiiy 
contusing effects of the individual variation 
rypicaiiy seen in ourbred populations. 

• Probe: me need tor relatively large amounts 
or RNA. which limits the rype of sample 
<e.g.. biopsy* » that can be used. Also, different 
RNA extraction methods can give different 
results. 

• Specificity: the abiliry to discriminate accu- 
rately between closely related genes (e.g.. the 
cytochrome p*o0 family) and splice variants. 

•Quantitation: the quantitation of gene 
expression using gene arrays is still open to 
debate. One reason for this is the different 
incorporation of the labeling dyes. However, 
the main difficulty lies in knowing what to 
normalize against. One option is to include a 
large number of so-called housekeeping genes 
in the amy. However, the expression of these 
genes often change depending on the tissue 
and the toxicant, so it is necessary to charac- 
terize the expression of these genes in the 
model system before utilizing them. This is 
clearly not a viable option when screening 
multiple new compounds. A second option 
is to include on the array genes from a nonre- 
lated species (e.g.. a plant gene on an animal 
array) and to spike the probe with synthetic 
RNA(s) complementary to the gene(s). 

• Reproducibility: this is sometimes question- 
able, and a figure of approximately two or 
three repeats was used as the minimum num- 
ber required to confirm initial findings. 
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• iensitivirv: concerns werr voiceJ jchvj: 
""number of target moiecuies that mus: be pre- 
sent in a sample for mem ro t>e uetcciec on 
the array. 

• Efficiency-, reproducible identincauon of 

to 2- fold differences in expression was report- 
ed, although the number of genes that 
undergo this level of change and remain 
undetected is open to debate, h is important 
that this level oi detection be ultinuteiv 
acmeved because it is commoniv perceived 
mat some important transcnotion factors 
and their regulators respond at such low io- 
eis. In most cases. 3- to Wolu was chc mini- 
mum change that most were happv to 
accept. 

• Bioinformatics: perhaps the greatest concern 
was how to accurately interpret the data with 
the greatest accuracy and efficiency. The 
biggest headache is trying to identify net- 
works or gene expression that are common to 
different treatments or doses. The amount of 
data from j single experiment is huge. It may 
be that, in the rururc. several groups individ- 
ual^ equipped with specialized software algo- 
rithms tor studying their favorite genes or 
gene systems will be able to share the same 
hybridized chips. Thus, arrays could usher in 
a new perspective on collaboration and the 
sharing of data. 

EPAMAC 

Perhaps the main reason most scientists are 
unable to use array technology is the high cost 
involved, whether buying off-the-shelf mem- 
branes, using contract printing services, or 
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figure 1 Gene expression profiles— also called fingerprints or signatures— of known toxicants or toxi- 
cant families may. in the future, be used to identify the potential toxicity of new drugs, etc. In this exam- 
ple, the genetic signature of test compound 1 is identical to that of known peroxisome proWerators, 
whereas that of test compound 2 does no: match any known toxicant family. Based on these results, test 
compound 2 would be retained for further testing and lest compound 1 would be eliminated. 
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producing chips in-house. In view of this, 
researched at rh: 'RTD.'NHEERL initialed 
the EPAMAC This consortium brines 
toeether scienusts from the EPA and a num- 
ber of extramural labs with the aim of devel- 
oping micToaxray capability through the shar- 
ing of resources and data. EPAMAC 
researchers arc primarily interested in the 
developmental and toxicologic changes seen 
in testicular and breast tissue, and a portion 
of the workshop was set aside for EPAMAC 
members to share their ideas on how the 
experimental application of microanays could 
facilitate their research. One of the central 
areas of interest to EPAMAC members is the 
effect of xenobiotics on male fertility and 
reproductive health. Of greatest concern is 
the effect of exposure during critical periods 
of development and germ cell differentiation 
(9). and how this may compromise sperm 
counts and quality following sexual matura- 
tion [JOl. As well as spermatogenic tissue, 
there is also interest in how residual mRNA 
found in mature sperm Ul) could be used as 
an indicator of previous xenobiotic effects fit 
is easier to obtain a semen sample than a tes- 
ticular biopsy). Arrays will be used to examine 
and compare the effect of exposure to heat 
and chemicals in testicular and epididymal 
gene expression profiles, with the aim of 
establishing relationships/associations 
between changes in develop mental landmarks 
and the effects on sperm count and quality. 
Cluster, pattern, and other analysis of such 
data should help identify hidden relationships 
between genes that may reveal potential 
mechanisms of action and uncover roles for 
genes with unknown functions. 

Summary 

The rull impact of DNA arrays may not be 
*een rbr several vean. but the interest shown at 
this resonai workshop indicates the high level 
of interest that they roster. .Apart rrom educat- 
ing and advertising the various technologies in 
this field, this workshop brought together a 
number of researchers from the Research 
Triangle Park area who are already using DNA 
arrays. The interest in sharing ideas and experi- 
ences led to the initiation of a Triangle array 
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Amy technology is still in its infancy*. This 
means that the hardware is still improving and 
there is no current consensus for standard pro- 
cedures, quantitation, and interpretation. 
Consistency in spotting and scanning arrays is - 
nor yet optimized, and this is one or the most 

critical requirements of any experiment. In 

addition, one of the dark regions ot array tech- 
nology — strife in the courts over who owns 
what portions of it — has further muddled the 
future and is a potential barrier toward the 
development of consensus procedures. 

Perhaps the greatest hurdle rbr the applica 1 
tion of arrays is the actual interpretation or 
data. No specialists in bioinrormatics attended 
the workshop, largely because they are rare and 
because as yet no one seems dear on the best 
method of approaching data analysis and inter- 
pretation. Cross-referencing results from mul- 
tiple e xperi ments (rime. dose, repeats, different 
animals, difrerent species) to identify common- 
ly expressed genes is a great challenge. In most 
cases, we are still a long way rrom understand- 
ing how the expression of gene X is related to 
the expression of gene Y. and ordering gene 
expression to delineate causal relationships. 

To the ordinary scientist in the typical lab- 
oratory, however, the most immediate prob- 
lem is a lack of affordable instrumentation. 
One can purchase premade membranes at 
relatively affordable prices. Although these 
may be useful in identifying individual genes 
to pursue in more detail using other methods, 
the numbers that would be required for even a 
small routine toxicology experiment prohibit 
this as a truly viable approach. For the toxicol- 
ogist. there is a need to earn* out multiple 
experiments— dose responses, time curves, 
multiple animals, and repeats. Glass-based 
DNA arrays are most attractive in this context 
because they can be prepared in large batches 
from the same DNA source and accommo- 
date control and treated samples on the same 
chip. Another prooiem witn current off-the- 
shelf arrays is that they otten do not contain 
one or more of the particular genes a group is 
interested in. One alternative is to obtain 
and/or produce a set of custom clones and 
have contract printing of membranes or slides 
carried out by a company such as Genomic 
Solutions, Inc. (Ann Arbor, MI). This approach 
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is less expensive than iavir.p ou: :*t:-. *■* 
one s own entire sv^terr.. a::noui:r. j: 
point it might maxe ecor.orr.^ >?r.>; rr:-: 
one s owd arravs. 

Finally. DNA amvs are currentiv j ream 
erTort. Thev are a technoiogv that uses j * 
Tinge of skills mciucung engineering, statisucs. 
molecular bioiog%. chemistry, and biointor- 
matio. Because mos; individuals are skilled in 
only one or perhaps rwo of rhese areas, it 
appears that success with arravs may be best 
expected by teams of collaborators consisting 
of individuals having each of these skills. 

Those considering amy applications mav 
be amused or goaded on by the following 
quote from Forrunr mapzine t I2i: 

Microprocessors have reshaped our econorm. 
spawneu vast romines arid chanced the wav »e live. 
Gene diiw could be even bigger. 

Although this comment may have been 
designed to excite the imagination rather than 
accurately reflect the truth, it is fair to say that 
the age of functional genomics is upon us. 
DNA arrays look set to be an important tool in 
this new age of biotechnology and will likely 
contribute answers to some of toxicology s 
most fundamental questions. 
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Abstract 

Recent progress in genomics and proieomics technologies has created a unique opportunity to significantly impact 
the pharmaceutical drug development processes. The perception that cells and whole organisms express specific 
inducible responses to stimuli such as drug treatment implies that unique expression patterns, molecular fingerprints, 
indicative of a drug's efficacv and potential toxicity are accessible. The integration into state-of-the-art toxicology of 
assays allowing one to prorile treatment-related changes in gene expression patterns promises new insights into 
mechanisms of drug action and toxicity. The benefits will be improved lead selection, and optimized monitoring of 
drug efficacy and safety in pre-clinical and clinical studies based on biologically relevant tissue and surrogate markers. 
C 20U0 Elsevier Science Ireland Ltd. All rights reserved. 

AV: north: Proieomics: Genomics: Toxicology 



1. Introduction 

The majority of drugs act by binding to protein 
targets, most to known proteins representing en- 
zymes, receptors and channels, resulting in effects 
such as enzyme inhibition and impairment ol 
signal transduction. The treatment-induced per- 
turbations provoke feedback reactions aiming to 
compensate for the stimulus, which almost always 
are associated with signals to the nucleus, result- 
ing in altered gene expression. Such gene expres- 
sion regulations account for both the 
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pharmacological action and the toxicity of a drug 
and can be visualized by either global mRNA or 
global protein expression prohling. Hence, for 
each individual drug, a characteristic gene regula- 
tion pattern, its molecular fingerprint, exists 
which bears valuable information on its mode of 
action and its mechanism of toxicity. 

Gene expression is a multistep process that 
results in an active protein (Fig. 1). There exist 
numerous regulation systems that exert control at 
and after the transcription and the translation 
step. Genomics, by definition, encompasses the 
quantitative analysis of transcripts at the mRNA 
level, while the aim of proieomics is to quantify 
gene expression further down-stream, creating a 
snapshot of gene regulation closer to ultimate cell 
function control. 
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2. Global mRNA profiling 

Expression data at the mRNA ievel can be 
produced using a set of different technologies 
such as DNA microarrays. reverse transcript 
imaging, amplified fragment length polymorphism 
(AFLP). serial analysis of gene expression 
(SAGE) and others. Currently. DNA microarrays 
are very popular and promise a eceat potential. 
On a typical array, each gene of interest is repre- 
sented either by a long DNA fragment (200-2400 
bp) typically generated by polymerase chain reac- 
tion fPCR) and spotted on a suitable substrate 
using robotics (Schena et aL. 1995; Shalon et aL 
1996) or by several short oligonucleotides (20-30 
bpj synthesized directly onto a solid support usine 
photolabile nucleotide chemistry (Fodor et aL 
1991; Chee et aL 1996). From control and treated 
tissues, total RNA or mRNA is isolated and 
reverse transcribed in the presence of radioactive 
or fluorescent labeled nucleotides, and the labeled, 
probes are then hybridized to the arrays. The 
intensity of the array signal is measured for each 
gene transcript by either autoradiography or laser 
scanning confocal microscopy. The ratio between 
the signals of control and treated samples reflect 
the relative drug-induced change in transcript 
abundance. 



Unen /;:-;/; <:ooo> 

3. Global protein profiling 

Global quantitative expression analysis a; 
protein level is currently :es:r.c:ed to the use of 
two-dimensional gel eiec;rophoresis. This tech- 
nique combines separation of tissue proteins bv 
isoelectric focusing in the nrs: dimension and bv 
sodium dodecyl sulfate slab gei electrophoresis, 
based molecular weight separation on the second 
orthogonal dimension (Anderson et uL 199J)' 
The product is a rectangular pattern of protein 
spots that are typically revealed ?y Coomassie 
Blue, silver or fluorescent staining iFis. 
Protein spots are identified by mass spectrometry 
following generation of peptide mass fingerprints 
(Mann et aL. 1993) and sequence tags (Wilkins ct 
aL. 1996). Similar to the mRNA approach, the 
ratio between the optical densit> of spots from 
control and treated samples are compared to 
search for treatment-related changes. 



4. Expression data analysis 

Bioinfonnatics forms a key element required to 
organize, analyze and store expression data from 
either source, the mRNA or the protein level. The 
overall objective, once a mass of high-quality 
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quantitative expression data has been collected, is 
io visualize complex patterns of gene expression 
changes, to detect pathways and sets ot genes 
uahtfv correlated with treatment efficacy and toxi- 
cftv and to compare the effects of different sets of 
treatment (Anderson et al.. 1996). As the drug 
effect database is erowing. one may detect similar- 
ities and differences between the molecular finger- 
prints produced by various drugs, information 
that mav be crucial to make a decision whether to 
refocus or extend the therapeutic spectrum of a 
drug candidate. 



5. Comparison of global mRNA and protein 
expression profiling 

There are several svnergies and overlaps of data 
obtained bv mRNA and protein expression analy- 
sis Low abundant transcripts may not be easily 
quantified at the protein level using standard two- 
dimensional eel electrophoresis analysis and their 
detection mav require prefracuonation of sam- 
Dies The expression of such genes may be preter- 
ablv quantified at the mRNA level using 
techniques allowine PCR-mediated target amplifi- 
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canon. Tissue biopsy samples typically yield good 
quaiity of both mRNA and proteins: however, the 
quality of mRNA isolated from body fluids is 
often poor due to the faster degradation of 
mRNA wffeh compared with proteins. RNA sam- 
ples from body fluids such as serum or urine are 
often not very meaningful*, and secreted proteins 
are likely more reliable surrogate markers for 
treatment efficacy and safetv Detection of post- 
translational modifications, events often related to 
function or nonfunction of a protein, is restricted 
to protein expression analysis and rarely can be 
predicted by mRNA profiling. Information on 
subcellular localization and translocation of 
proteins has to be acquired at the level of the 
protein in combination with sample prefractiona- 
tion procedures. The growing evidence of a poor 
correlation between mRNA and protein abun- 
dance (Anderson and Seilhamer. 1997) further 
suggests that the' two approaches. mRNA and 
protein profiling, are complementary and should 
be applied in parallel. 

6. Expression profiling and drug development 

Understanding the mechanisms of action and 
toxicity, and being able to monitor treatment 
efficacy and safety during trials is crucial for the 
successful development of a drug. Mechanistic 
insights are essential for the interpretation of drug 
effects and enhance the chances of recognizing 
potential species specificities contributing to an 
improved risk profile in humans < Richardson et 
aL 1993: Steiner et aL 1996b: Aicher et al.. 1998). 
The value of expression profiling further increases 
when links between treatment-induced expression 
profiles and specific pharmacological and toxic 
endpoints are established (Anderson et al.. 1991. 
1995. 1996; Sterner et ai. 1996a). Changes in gene 
expression are known to precede the manifesta- 
tion of morphological alterations, giving expres- 
sion profiling a great potential for early 
compound screening, enabling one to select drug 
candidates with wide therapeutic windows 
reflected by molecular fingerprints indicative of 
high pharmacological potency and low toxicity 
(Arce et al.. 1998). In later phases of drug devel- 



opment, surrogate marker? o: ".rea'.rr.er* 
and loxicity can be appiiee to optimize :r.e rn Cr 
tonng of pre-ciinical and mimical studies < Dors- 
et al.. ! i 



7. Perspectives 

The basic methodoiog> of safet> evaluation hy 
changed little during the past decades. Toxicit\ - 
laboratory animals has beer, evaluated rnmaniv 
by using hematological, cimicai chemistr. ir.d 
histological parameters as indicators of orzzr 
damage. The rapid progress :n genomics and pro- 
teomics technologies creates a unique opportunity 
to dramatically improve the predictive power of 
safety assessment and to accelerate the drug devel- 
opment process. Application of gene and protein 
expression profiling promises to impro\e lead se- 
lection, resulting in the de\e!opment of drug can- 
didates with higher efficuc> and lower toxic*" 
The identification of biologically relevant surr - 
gate markers correlated with treatment efficacy 
and safety bears a great potential to optimize the 
monitoring of pre-chnical and clinical trails. 



References 

Aicher. L.. Wahi. D.. Ar.j. \ Grcnet. O. Sicincr. S.. 

New insights into »:>ciosporine A ncphroto\icu> hv pi** 
teome anaKsis EleciropnorcM> I -J. 

\ndcr*on. N L. So Hurler. J . \ companion of selected 

mRNA and protein aoundancc> in human li\cr. Elec- 
trophoresis IS. 533 - 5.'~ 

Anderson. NX.. E$quer-Bla>co. R . Hofmann. J.P . Anderson. 
N .C.. I9*M A i\k o-dtmcnMon.il ::ei database of rat liver 
proteins useful in ^ene regulation and dru;: effects studies 
Electrophoresis II. 

Anderson, L. Steele. V K... kciiotf. G J.. Sharma. S.. ! uu ' 
Effects of olnprui and related cheniopre\ention cc;::* 
pounds on sene e\prc»ion in rat li\er J Cell. Biochem 
Suppl. 108-110 

Anderson. N.L.. E>quer-Bia>co. R . Richardson. F . Foxwor- 
:h>. P.. Eacho. P.. 1^ The effects of peroxisome prolifer- 
ators on protein abundances in mouse h\er Toucol. Appl. 
Pharmacol, l.w. "5-S9 

Arce. A.. Aicher. L.. Wahl. D . E>quer-Bla>co. R.. Anderson. 
N.L.. Cordier. A.. Sterner. S. !*W$. Changes in the h\er 
proteome of female Wtstar rais treated with the hypo- 
glycemic agent SDZ PGL tW.v Life Sci. 6_v H43-2Z5«i. 



M Yane. R.. Hubbel!. £.. Bemo. A.. Huang. \ C . 
C * lc ~ D.. J - Lo«nar:. D.J.. Morn, M i 

Fodor SP Accessing genetic miormatior. »un 

;,°annicnsH> DNA arra>s Science 2^ blO-bM 
D ne'm VS.. Litirr^r.. B H.. Re.«>- ^- 5*moeU. AC. Bu„. 
j Anderson. S.L~ Analysis of changes m acute- 
phase piasma proteins .» an acute mnammaton response 
Id .n rneum.ic.c urthnus using two-dimensiona. get «ec 
trophoresis. Electrophoresis 19. 355-, 6-v 
cJ« SP Read J.L.. Pirrung. MC. Str>er. L.. Lu. AT.. 
Solas. D. . Ught-d.rected. spatially addressable parai- 
lei chemical svmhesis. Science :51. "6"- -v 
Mann M.. Ho.ru*. P. Roepsuorff. P.. 1993. tse oi mass 
oectrometrtc modular .eight miormauon to .denui> 
protems in sequence databases. Biol. Mass Spectrom. ... 

RlC naro"son' F C . Strom. S C.. Copple. D M Bendeie. R.A 
Proosi. G.S.. Anderson. N.L.. 1993. Comparison, o 
protein changes :n human and roaent hepatoses induced 
b> the rai-specinc caranogen. methapyrilene. Elec 
iropnorcsis I-. I:" - 161. 



Scnena. M.. Shaion. D. Daw*. R * . Bro*". 

Quantitative monuor.nc c: gene exrre>s>-.or. 

a .*omexmcnur> D^A microarra\ Sc:;r.:c - 
Shaion. D . Smitn. S J . Brown. PC A D s A m:.-r.\^- 

r 3 > >>stem tor an<m::r.c comnex D\A -arr.r;e> a>:r.: 

tvwo-coior r\uorescent rroce n>cno;iat:or. uer.orr.c Res ? 

o >0-rU5 

Siciner. S.. Warn. D. Mangold. BU. Ror.ior. R. Ra^:- 
nackers. J.. Meheus. L.. Anoerson. NL Cow*:. A. 
iQQoa. Induction oi the adipose d::":erer.;:at:on-reia:ec 
protein in hver ot' etomo\ir treated rati 3iocr.em Bierm, 
Res. Commun. -IS. -~S2. 
Sterner. S.. Aicner. L.. Ra>mackers. J Meneus. L E^usr- 
Blasco. R.. Anderson. L..Cordier. A.. :*»t>b L^iosportnc 
A mediated decrease in tne rat rcna: .au:um sino:r.£ 
protein calbindin-D IS kDa Biocner. Prurrr.acoi 
153-258 

Wilkms. MR.. Gasteiger. E.. Sanchez. J C . Arpel. R 0 
Hochstrasser. DF . Proicm :der.tiricauon *im se- 

quence tags. Curr. Biol, o. 15-43 -'.v^ 



pro- 



Biochemistry: Brenner et al 
* • » * 

likely, its power can be attributed to its incorporation of more 
information than any other measure; it takes account of the 
full substitution and gap data (like raw scores) but also has 
details about the sequence lengths and composition and is 
scaled appropriately. 

We find that statistical scores are not only powerful, but also 
easy to interpret, ssearch and Fasta show close agreement 
between statistical scores and actual number of errors per 
query (Fig. 4). The expectation value score gives a good, 
slightly conservative estimate of the chances of the two se- 
quences being found at random in a given query. Thus, an 
E-value of 0.01 indicates that roughly one pair of nonhomologs 
of this similarity should be found in every 100 different queries. 
Neither raw scores nor percentage identity can be interpreted 
in this way, and these results validate the suitability of the 
extreme value distribution for describing the scores from a 
database search. 

The P-values from blast also should be directly interpret- 
able but were found to overstate significance by more than two 
orders of magnitude for 1% EPQ for this database. Nonethe- 
less, these results strongly suggest that the analytic theory is 
fundamentally appropriate. WU-BLAST2 scores were more re- 
liable than those from blast, but also exaggerate expected 
confidence by more than an order of magnitude at 1% EPQ. 

Overall Detection of Homologs and Comparison of Algo- 
rithms. The results in Fig. 5A and Table 1 show that pairwise 
sequence comparison is capable of identifying only a small 
fraction of the homologous pairs of sequences in PDB40D-B. 
Even SSEARCH with E-values, the best protocol tested, could 
find only 18% of all relationships at a 1% EPQ. blast, which 
identifies 15%, was the worst performer, whereas fasta 
ktup = 1 is nearly as effective as ssearch. fasta ktup = 2 and 
WU-BLAST2 are intermediate in their ability to detect ho- 
mologs. Comparison of different algorithms indicates that 
those capable of identifying more homologs are generally 
slower. SSEARCH is 25 times slower than BLAST and 6.5 times 
slower than fasta ktup = 1. wu-blast2 is slightly faster than 
fasta ktup = 2, but the latter has more in terpretable scores. 

In PDB90D-B, where there are many close relationships, the 
best method can identify only 38% of structurally known 
homologs (Fig. 5B). The method which finds that many 
relationships is wu-blasT2. Consequently, we infer that the 
differences between fasta kup = 1, ssearch, and wu-blasti 
programs are unlikely to be significant when compared with 
variation in database composition and scoring reliability. 

Fig. 6 helps to explain why most distant homologs cannot be 
found by sequence comparison: a great many such relation- 
ships have no more sequence identity than would be expected 
by chance, ssearch with E-values can recognize >90% of the 
homologous pairs with 30-40% identity. In this, region, there 
are 30 pairs of homologous proteins that do not have signif- 
icant E-values, but 26 of these involve sequences with <50 
residues. Of sequences having 25-30% identity, 75% are 
identified by ssearch E-values. However, although the num- 
ber of homologs grows at lower levels of identity, the detection 
falls off sharply: only 40% of homologs with 20-25% identity 
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Fig. 6. Distribution and detection of homologs in PDB40D-B. Bars 
show the distribution of homologous pairs PDB40D-B according to their 
identity (using the measure of identity in both). Filled regions indicate 
the number of these pairs found by the best database searching method 
(ssearch with E-values) at 1% EPQ. The PDB40D-B database contains 
proteins with <40% identity, and as shown on this graph, most 
structurally identified homologs in the database have diverged ex- 
tremely far in sequence and have <20% identity. Note that the 
alignments may be inaccurate, especially at iow levels of identity. Filled 
regions show that ssearch can identify most relationships that have 
25% or more identity, but its detection wanes sharply below 25%. 
Consequently, the great sequence divergence of most structurally 
identified evolutionary relationships effectively defeats the ability of 
pariwise sequence comparison to detect them. 

are detected and only 10% of those with 15-20% can be found. 
These results show that statistical scores can find related 
proteins whose identity is remarkably low; however, the power 
of the method is restricted by the great divergence of many 
protein sequences. 

After completion of this work, a new version of pairwise 
blast was released: blastgp (37). It supports gapped align- 
ments, like WU-BLAST2, and dispenses with sum statistics. Our 
initial tests on blastgp using default parameters show that its 
E-values are reliable and that its overall detection of homologs 
was substantially better than that of ungapped blast, but not 
quite equal to that of WU-BLAST2. 

CONCLUSION 

The general consensus amongst experts (see refs. 7, 24, 25, 27 
and references therein) suggests that the most effective se- 
quence searches are made by (i) using a large current database 
in which the protein sequences have been complexity masked 
and (ii) using statistical scores to interpret the results. Our 
experiments fully support this view. 

Our results also suggest two further points. First, the E-val- 
ues reported by fasta and ssearch give fairly accurate 
estimates of the significance of each match, but the P-values 
provided by BLAST and wu-BLAST2 underestimate the true 



Table 1. Summary of sequence comparison 


methods with PDB40D-B 






Method 


Relative Time* 


1% EPQ Cutoff 


Coverage at 1% EPQ 


ssearch % identity: within alignment 
SSEARCH % identity: within both 
ssearch % identity: HSSP-scaled 
ssearch Smith-Waterman raw scores 
ssearch E-values 
fasta ktup = 1 E-values 
fasta ktup = 2 E-values 
WU-BLAST2 P-values 
blast P-values 


25.5 
25.5 
25.5 
25.5 
25.5 
3.9 
1.4 
1.1 
1.0 


>70% 
34% 

35% (hssp + 9.8) 
142 
0.03 
0.03 
0.03 
0.003 
0.00016 


<0.1 
3.0 
4.0 
10.5 
18.4 
17.9 
16,7 
17.5 
14.8 


♦Times are from large database searches with genome proteins. 
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extent of errors. Second, SSEARCH, wu-blast2, and fasta 
ktup = 1 perform best, though blast and fasta ktup = 2 
detect most of the relationships found by the best procedures 
and are appropriate for rapid initial searches. 

The homologous proteins that are found by sequence com- 
parison can be distinguished with high reliability from the huge 
number of unrelated pairs. However, even the best database 
searching procedures tested fail to find the large majority of 
distant evolutionary relationships at an acceptable error rate. 
Thus, if the procedures assessed here fail to find a reliable 
match, it does not imply that the sequence is unique; rather, it 
indicates that any relatives it might have are distant ones.** 



** Additional and updated information about this work, including 
supplementary figures, may be found at http://sss.stanford.edu/sss/. 
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